+ All Categories
Home > Documents > Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...

Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...

Date post: 22-Dec-2015
Category:
Upload: jerome-henry
View: 221 times
Download: 5 times
Share this document with a friend
Popular Tags:
283
1 Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]> A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore INDIA December 16, 2002 By Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi INDIA e-mail: [email protected] http://www.cse.iitd.ac.in/ ~dheerajb N. Seetharama Krishna Centre for Development of Advanced Computing Pune University Campus, Pune INDIA e-mail: [email protected] http://www.cdacindia.com
Transcript
Page 1: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

1Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

A Tutorial

Designing Cluster Computers and

High Performance Storage Architectures

At

HPC ASIA 2002, Bangalore INDIADecember 16, 2002

ByDheeraj BhardwajDepartment of Computer Science & Engineering Indian Institute of Technology, Delhi INDIA e-mail: [email protected]://www.cse.iitd.ac.in/~dheerajb

N. Seetharama KrishnaCentre for Development of Advanced ComputingPune University Campus, Pune INDIA e-mail: [email protected]://www.cdacindia.com

Page 2: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

2Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• All the contributors of LINUX• All the contributors of Cluster

Technology• All the contributors in the art and

science of parallel computing • Department of Computer Science &

Engineering, IIT Delhi• Centre for Development of Advanced

Computing, (C-DAC) and collaborators

Acknowledgments

Page 3: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

3Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• The information and examples provided are based on the Red Hat Linux 7.2 installation on the Intel PCs platforms ( our specific hardware specifications)

• Much of it should be applicable to other versions of Linux

• There is no warranty that the materials are error free

• Authors will not be held responsible for any direct, indirect, special, incidental or consequential damages related to any use of these materials

Disclaimer

Page 4: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

4Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Part – I

Designing Cluster Computers

Page 5: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

5Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Outline

• Introduction• Classification of Parallel

Computers• Introduction to Clusters• Classification of Clusters• Cluster Components Issues

– Hardware– Interconnection Network– System Software

• Design and Build a Cluster Computers– Principles of Cluster Design– Cluster Building Blocks– Networking Under Linux– PBS– PVFS– Single System Image

Page 6: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

6Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Outline

• Tools for Installation and Management– Issues related to Installation, configuration,

Monitoring and Management– NPACI Rocks– OSCAR– EU DataGrid WP4 Project– Other Tools – Sycld Beowulf, OpenMosix,

Cplant, SCore• HPC Applications and Parallel

Programming– HPC Applications– Issues related to parallel Programming– Parallel Algorithms– Parallel Programming Paradigms– Parallel Programming Models– Message Passing– Applications I/O and Parallel File System– Performance metrics of Parallel Systems

• Conclusion

Page 7: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

7Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Introduction

Page 8: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

8Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

What do We Want to Achieve ?

• Develop High Performance Computing (HPC) Infrastructure which is

• Scalable (Parallel MPP Grid) • User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes• Cost Effective

• Develop HPC Applications which are • Portable ( Desktop Supercomputers

Grid)• Future Proof• Grid Ready

Page 9: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

9Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Who Uses HPC ?

• Scientific & Engineering Applications• Simulation of physical phenomena• Virtual Prototyping (Modeling)• Data analysis

• Business/ Industry Applications• Data warehousing for financial sectors• E-governance• Medical Imaging• Web servers, Search Engines, Digital libraries• …etc …..

• All face similar problems• Not enough computational resources• Remote facilities – Network becomes the

bottleneck • Heterogeneous and fast changing systems

Page 10: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

10Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HPC Applications

• Three Types• High-Capacity – Grand Challenge Applications

• Throughput – Running hundreds/thousands of job, doing parameter studies, statistical analysis etc…

• Data – Genome analysis, Particle Physics, Astronomical observations, Seismic data processing etc

• We are seeing a Fundamental Change in HPC Applications

• They have become multidisciplinary• Require incredible mix of varies technologies and

expertise

Page 11: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

11Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Why Parallel Computing ?

• If your Application requires more computing power than a sequential computer can provide ? !!!!!– You might suggest to improve the operating

speed of processor and other components

– We do not disagree with your suggestion BUT how long you can go ?

• We always have desire and prospects for greater performance

Parallel Computing is the right answer

Page 12: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

12Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Serial and Parallel Computing

SERIAL COMPUTING

Fetch/Store

Compute

PARALLEL COMPUTING

Fetch/Store

Compute/communicate

Cooperative game

A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.

Page 13: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

13Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Classification of Parallel Computers

Page 14: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

14Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Classification of Parallel Computers

Flynn Classification: Number of Instructions & Data Streams

Conventional Data Parallel,

Vector Computing

Systolic Arrays

very general, multiple approaches

Page 15: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MIMD

Non-shared memory

Shared memory

MPP

Clusters

Uniform memory access

PVP

SMP

Non-Uniform memory access

CC-NUMA

NUMA

COMA

MIMD Architecture: Classification

Current focus is on MIMD model, using general purpose processors or multicomputers.

Page 16: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

16Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MIMD: Shared Memory Architecture

Source PE writes data to Global Memory & destination retrieves it

Easy to build Limitation : reliability & expandability. A memory

component or any processor failure affects the whole system.

Increase of processors leads to memory contention.Ex. : Silicon graphics supercomputers....

Processor 1

Processor 1

Processor 2

Processor 2

Processor 3

Processor 3

Mem

ory

B

us

Mem

ory

B

us

Mem

ory

B

us

Global MemoryGlobal Memory

Page 17: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

17Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MIMD: Distributed Memory Architecture

Inter Process Communication using High Speed Network. Network can be configured to various topologies e.g. Tree, Mesh,

Cube.. Unlike Shared MIMD

easily/ readily expandable Highly reliable (any CPU failure does not affect the whole

system)

Processor 1

Processor 1

Processor 2

Processor 2

Processor 3

Processor 3

Mem

ory

B

us

Mem

ory

B

us

Mem

ory

B

us

High Speed Interconnection Network

High Speed Interconnection Network

Memory 1

Memory 1

Memory 2

Memory 2

Memory 3

Memory 3

Page 18: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

18Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MIMD Features

MIMD architecture is more general purpose

MIMD needs clever use of synchronization that comes from message passing to prevent the race condition

Designing efficient message passing algorithm is hard because the data must be distributed in a way that minimizes communication traffic

Cost of message passing is very high

Page 19: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

19Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Non-Uniform memory access (NUMA) shared address space computer with local and global memories

Time to access a remote memory bank is longer than the time to access a local word

Shared address space computers have a local cache at each processor to increase their effective processor-bandwidth.

The cache can also be used to provide fast access to remotely –located shared data

Mechanisms developed for handling cache coherence problem

Shared Memory (Address-Space) Architecture

Page 20: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

20Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Interconnection Network

MM M

M PM PM P

Non-uniform memory access (NUMA) shared-address-space computer with local and global memories

Shared Memory (Address-Space) Architecture

Page 21: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

21Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Interconnection Network

M PM PM P

Non-uniform-memory-access (NUMA) shared-address-space computer with local memory only

Shared Memory (Address-Space) Architecture

Page 22: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

22Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Provides hardware support for read and write access by all processors to a shared address space.

Processors interact by modifying data objects stored in a shared address space.

MIMD shared -address space computers referred as multiprocessors

Uniform memory access (UMA) shared address space computer with local and global memories

Time taken by processor to access any memory word in the system is identical

Shared Memory (Address-Space) Architecture

Page 23: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

23Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Interconnection Network

PPP

MM M

Uniform Memory Access (UMA) shared-address-space computer

Shared Memory (Address-Space) Architecture

Page 24: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

24Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Uniform Memory Access (UMA)

• Parallel Vector Processors (PVPs)

• Symmetric Multiple Processors (SMPs)

Page 25: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

25Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Parallel Vector Processor

VP : Vector Processor

SM : Shared memory

Page 26: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

26Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Works good only for vector codes

Scalar codes mat not perform perform well

Need to completely rethink and re-express algorithms so that vector instructions were performed almost exclusively

Special purpose hardware is necessary

Fastest systems are no longer vector uniprocessors.

Parallel Vector Processor

Page 27: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

27Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Small number of powerful custom-designed vector processors used

Each processor is capable of at least 1 Giga flop/s performance

A custom-designed, high bandwidth crossbar switch networks these vector processors.

Most machines do not use caches, rather they use a large number of vector registers and an instruction buffer

Examples : Cray C-90, Cray T-90, Cray T-3D …

Parallel Vector Processor

Page 28: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

28Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

P/C : Microprocessor and cache

SM : Shared memory

Symmetric Multiprocessors (SMPs)

Page 29: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Symmetric Multiprocessors (SMPs) characteristics

Uses commodity microprocessors with on-chip and off-chip caches.

Processors are connected to a shared memory through a high-speed snoopy bus

On Some SMPs, a crossbar switch is used in addition to the bus.

Scalable upto:

4-8 processors (non-back planed based)

few tens of processors (back plane based)

Symmetric Multiprocessors (SMPs)

Page 30: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

30Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

All processors see same image of all system resources

Equal priority for all processors (except for master or boot CPU)

Memory coherency maintained by HW

Multiple I/O Buses for greater Input / Output

Symmetric Multiprocessors (SMPs) characteristics

Symmetric Multiprocessors (SMPs)

Page 31: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

DIRController

Memory

I/OBridge

I/O Bus

Symmetric Multiprocessors (SMPs)

Page 32: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Issues

Bus based architecture :

Inadequate beyond 8-16 processors

Crossbar based architecture

multistage approach considering I/Os required in hardware

Clock distribution and HF design issues for backplanes

Limitation is mainly caused by using a centralized shared memory and a bus or cross bar interconnect which are both difficult to scale once built.

Symmetric Multiprocessors (SMPs)

Page 33: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

33Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Sun Ultra Enterprise 10000 (high end, expandable upto 64 processors), Sun Fire

DEC Alpha server 8400

HP 9000

SGI Origin

IBM RS 6000

IBM P690, P630

Intel Xeon, Itanium, IA-64(McKinley)

Commercial Symmetric Multiprocessors (SMPs)

Page 34: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

34Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Heavily used in commercial applications (data bases, on-line transaction systems)

System is symmetric (every processor has equal equal access to the shared memory, the I/O devices, and the operating systems.

Being symmetric, a higher degree of parallelism can be achieved.

Symmetric Multiprocessors (SMPs)

Page 35: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

P/C : Microprocessor and cache; LM : Local memory; NIC : Network interface circuitry; MB : Memory bus

Massively Parallel Processors (MPPs)

Page 36: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

36Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Commodity microprocessors in processing nodes

Physically distributed memory over processing nodes

High communication bandwidth and low latency as an interconnect. (High-speed, proprietary communication network)

Tightly coupled network interface which is connected to the memory bus of a processing node

Massively Parallel Processors (MPPs)

Page 37: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

37Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Provide proprietary communication software to realize the high performance

Processors Interconnected by a high-speed memory bus to a local memory through and a network interface circuitry (NIC)

Scaled up to hundred or even thousands of processors

Each processes has its private address space and Processes interact by passing messages

Massively Parallel Processors (MPPs)

Page 38: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

38Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MPPs support asynchronous MIMD modes

MPPs support single system image at different levels

Microkernel operating system on compute nodes

Provide high-speed I/O system

Example : Cray – T3D, T3E, Intel Paragon, IBM SP2

Massively Parallel Processors (MPPs)

Page 39: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

39Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Cluster ?

A Cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone/complete computers cooperatively working together as a single, integrated computing resource.

clus·ter n. 1. A group of the same or similar

elements gathered or occurring closely together; a bunch: “She held out her hand, a small tight cluster of fingers” (Anne Tyler).

2. Linguistics. Two or more successive consonants in a word, as cl and st in the word cluster.

Page 40: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

40Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Programming Environment Web Windows Other Subsystems(Java, C, Fortran, MPI, PVM) User Interface (Database, OLTP)

Single System Image Infrastructure

Availability Infrastructure

OS

Node

OS

Node

OS

Node

Interconnect

……… … …

Cluster System Architecture

Page 41: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

A set of Nodes physically connected over

commodity/ proprietary network

Gluing Software

Other than this definition no Official Standard exists Depends on the user requirements

Commercial Academic Good way to sell old wine in a new bottle Budget Etc ..

Designing Clusters is not obvious but Critical issue.

Clusters ?

Page 42: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

42Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Why Clusters NOW?

• Clusters gained momentum when three

technologies converged:– Very high performance microprocessors

• workstation performance = yesterday supercomputers

– High speed communication– Standard tools for parallel/ distributed

computing & their growing popularity

• Time to market => performance• Internet services: huge demands for

scalable, available, dedicated internet servers– big I/O, big computing power

Page 43: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

43Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

How should we Design them ?

Components•Should they be off-the-shelf and

low cost?•Should they be specially built?• Is a mixture a possibility?

Structure•Should each node be in a

different box (workstation)?•Should everything be in a box?•Should everything be in a chip?

Kind of nodes•Should it be homogeneous?•Can it be heterogeneous?

Page 44: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

44Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

What Should it offer ?

Identity•Should each node maintains its

identity (and owner)?•Should it be a pool of nodes?

Availability•How far should it go?

Single-system Image•How far should it go?

Page 45: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

45Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Place for Clusters in HPC world ?

Distance between nodes

A chip

A box

A room

A building

The world

Dis

trib

ute

d c

om

puti

ng

Grid computing

Cluster computing

SM Parallelcomputing

Source: Toni Cortes ([email protected])

Page 46: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

46Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Distributedsystems

MPsystems

• Gather (unused) resources• System SW manages resources• System SW adds value• 10% - 20% overhead is OK• Resources drive applications• Time to completion is not

critical• Time-shared• Commercial: PopularPower,

United Devices, Centrata, ProcessTree, Applied Meta, etc.

• Bounded set of resources • Apps grow to consume all

cycles• Application manages

resources• System SW gets in the way• 5% overhead is maximum• Apps drive purchase of

equipment• Real-time constraints• Space-shared

Legi

on\G

lobu

sB

eow

ulf

Ber

kley

NO

WS

uper

clus

ters

Inte

rnet

AS

CI R

ed

Tf

lops

SE

TI@

hom

e

Con

dor

Where Do Clusters Fit?

Src: B. Maccabe, UNM, R.Pennington NCSA

15 TF/s delivered 1 TF/s delivered

Page 47: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

47Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Top 500 Supercomputers

• From www.top500.org

Rank

Computer/Procs Peak performance

Country/year

1 Earth Simulator (NEC)5120

40960 GF Japan / 2002

2 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096

10240 GF LANL, USA/2002

3 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096

10240 GF LANL, USA/2002

4 ASCI White (IBM) SP power 3 375 MHz / 8192

12288 GF LANL, USA/2000

5 MCR Linux Cluster Xeon 2.4 GHz – Qudratics / 2304

11060GF LANL, USA/2002

Page 48: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

48Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

What makes the Clusters ?

• The same hardware used for– Distributed computing– Cluster computing– Grid computing

• Software converts hardware in a cluster– Tights everything together

Page 49: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

49Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Task Distribution

• The hardware is responsible for– High-performance– High-availability– Scalability (network)

• The software is responsible for– Gluing the hardware– Single-system image– Scalability– High-availability– High-performance

Page 50: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

50Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Classification of

Cluster Computers

Page 51: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

51Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 1

• Based on Focus (in Market)– High performance (HP) clusters

•Grand challenging applications

– High availability (HA) clusters•Mission critical applications•Web/e-mail •Search engines

Page 52: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

52Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HA Clusters

Page 53: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

53Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 2

• Based on Workstation/PC Ownership– Dedicated clusters– Non-dedicated clusters

•Adaptive parallel computing•Can be used for CPU cycle

stealing

Page 54: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

54Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 3

• Based on Node Architecture– Clusters of PCs (CoPs)– Clusters of Workstations (COWs)– Clusters of SMPs (CLUMPs)

Page 55: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

55Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 4

• Based on Node Components Architecture & Configuration:– Homogeneous clusters

• All nodes have similar configuration

– Heterogeneous clusters• Nodes based on different processors

and running different OS

Page 56: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

56Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 5

• Based on Node OS Type..– Linux Clusters (Beowulf)– Solaris Clusters (Berkeley NOW)– NT Clusters (HPVM)– AIX Clusters (IBM SP2)– SCO/Compaq Clusters (Unixware)– …….Digital VMS Clusters, HP

clusters, ………………..

Page 57: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

57Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clusters Classification 6

• Based on Levels of Clustering:– Group clusters (# nodes: 2-99)

•A set of dedicated/non-dedicated computers --- mainly connected by SAN like Myrinet

– Departmental clusters (# nodes: 99-999)

– Organizational clusters (# nodes: many 100s)

– Internet-wide clusters = Global clusters(# nodes: 1000s to many millions)•Computational Grid

Page 58: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

58Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Clustering EvolutionC

OST

CO

MPLE

XIT

Y

Time1990 2005

1st Gen.MPP SuperComputers

2nd Gen.BeowulfClusters

3rd Gen.Commercial

GradeClusters 4th Gen.

NetworkTransparent

Clusters

Page 59: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

59Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

– Hardware– System Software

Cluster Components

Page 60: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

60Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Hardware

Page 61: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

61Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Nodes

• The idea is to use standard off-the-shelf processors– Pentium like Intel, AMDK– Sun– HP– IBM– SGI

• No special development for clusters

Page 62: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

62Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Interconnection Network

Page 63: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

63Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Interconnection Network

• One of the key points in clusters

• Technical objectives– High bandwidth – Low latency– Reliability– Scalability

Page 64: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

64Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Design Issues

• Plenty of work has been done to improve networks for clusters

• Main design issues– Physical layer– Routing – Switching– Error detection and correction– Collective operations

Page 65: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

65Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Physical Layer

• Trade-off between – Raw data transfer rate and cable cost

• Bit width– Serial mediums (*Ethernet, Fiber

Channel)•Moderate bandwidth

– 64-bit wide cable (HIPPI)•Pin count limits the

implementation of switches– 8-bit wide cable (Myrinet, ServerNet)

•Good compromise

Page 66: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

66Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Routing

• Source-path– The entire path is attached to the message

at its source location– Each switch deletes the current head of the

path

• Table-based routing– The header only contains the destination

node– Each switch has a table to help in the

decision

Page 67: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

67Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Switching

• Packet switching– Packets are buffered in the switch before

resent• Implies an upper-bound packet size• Needs buffers in the switch

– Used by traditional LAN/WAN networks

• Wormhole switching– Data is immediately forwarded to the next

stage• Low latency• No buffers are needed• Error correction is more difficult

– Used by SANs such as Myrinet, PARAMNet

Page 68: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

69Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Error Detection

• It has to be done at hardware level– Performance reasons– i.e. CRC checking is done by the

network interface

• Networks are very reliable– Only erroneous messages should

see overhead

Page 69: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

70Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Collective Operations

• These operations are mainly– Barrier– Multicast

• Few interconnects offer this characteristic– Synfinity is good example

• Normally offered by software– Easy to achieve in bus-based like

Ethernet– Difficult to achieve in point-to-point

like Myrinet

Page 70: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

71Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Examples of Network

• The most common networks used are– *-Ethernet– SCI– Myrinet– PARAMNet– HIPPI– ATM– Fiber Channel– AmpNet– Etc.

Page 71: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

72Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

*-Ethernet

• Most widely used for LAN– Affordable– Serial transmission– Packet switching and table-based routing

• Types of Ethernet– Ethernet and Fast Ethernet

•Based on collision domain (Buses)•Switched hubs can make different

collision domains– Gigabit Ethernet

•Based on high-speed point-to-point switches

•Each nodes is in its own collision domain

Page 72: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

73Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

ATM

• Standard designed for telecommunication industry– Relatively expensive– Serial– Packet switching and table-based

routing– Designed around the concept of

fixed-size packets

• Special characteristics– Well designed for real-time systems

Page 73: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

74Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Scalable Coherent Interface (SCI)

• First standard specially designed for CC– Low layer

• Point-to-point architecture but maintains bus-functionality

• Packet switching and table-based routing• Split transactions• Dolphin Interconnect Solutions, Sun SPARC

Sbus– High Layer

• Defines a distributed cache-coherent scheme• Allows transparent shared memory

programming• Sequent NUMA-Q, Data general AViiON

NUMA

Page 74: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

75Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PARAMNet & Myrinet

• Low-latency and High-bandwidth network– Characteristics

• Byte-wise links• Wormhole switching and source-path routing

– Low-latency cut-through routing switches• Automatic mapping, which favors fault

tolerance• Zero-copying is not possible

– Programmable on-board processor• Allows experimentation with new protocols

Page 75: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

77Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Communication Protocols

• Traditional protocols– TCP and UDP

• Specially designed– Active messages– VMMC– BIP– VIA– Etc.

Page 76: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

78Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Data Transfer

• User-level lightweight communication– Avoid OS calls– Avoid data copying– Examples

•Fast messages, BIP, ...• Kernel-level lightweight

communication– Simplified protocols– Avoid data copying– Examples

•GAMMA, PM, ...

Page 77: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

79Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

TCP and UDP

• First messaging libraries used– TCP is reliable– UDP is not reliable

• Advantages– Standard and well known

• Disadvantages– Too much overhead (specially

for fast networks)• Plenty OS interaction• Many copies

Page 78: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

80Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Active Messages

• Low-latency communication library

• Main issues– Zero-copying protocol

• Messages copied directly – to/from the network – to/from the user-address

space• Receiver memory has to be

pinned– There is no need of a receive

operation

Page 79: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

81Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

VMMC

• Virtual-Memory Mapped Communication– View messages as read and writes to

memory•Similar to distributed shared memory

– Makes a correspondence between•A virtual pages at the receiving side•A virtual page at the sending side

Page 80: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

82Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

BIP

• Basic Interface for Parallelism– Low-level message-layer for

Myrinet– Uses various protocols for

various message sizes– Tries to achieve zero copies (one

at most)

• Used via MPI by programmers– 7.6us latency– 107 Mbytes/s bandwidth

Page 81: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

83Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

VIA

• Virtual Interface Architecture– First standard promoted by the industry– Combines the best features of academic

projects

• Interface– Designed to be used by programmers

directly– Many programmers believe it to be too low

level•Higher-level APIs are expected

• NICs with VIA implemented in hardware– This is the proposed path

Page 82: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

84Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Potential and Limitations

• High bandwidth – Can be achieved at low cost

• Low latency– Can be achieved, but at high cost– The lower the latency is the closer to

a traditional supercomputer we get

• Reliability– Can be achieved at low cost

• Scalability– Easy to achieve for the size of

clusters

Page 83: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

85Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

System Software

–Operating system vs. middleware

–Processor management

–Memory management

–I/O management

–Single-system image

–Monitoring clusters

–High Availability

–Potential and limitations

Page 84: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

86Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Operating system vs. Middleware

• Operating system– Hardware-control

layer

• Middleware– Gluing layer

• The barrier is not always clear

• Similar – User level– Kernel level

Operating systemMiddleware

Page 85: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

87Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

System Software

• We will not distinguish between– Operating system– Middleware

• The middleware related to the operating system

• Objectives– Performance/Scalability– Robustness– Single-system image– Extendibility– Scalability– Heterogeneity

Page 86: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

88Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Processor Management

• Schedule jobs onto the nodes– Scheduling policies should take into

account• Needed vs. available resources

– Processors– Memory– I/O requirements

• Execution-time limits• Priorities

– Different kind of jobs• Sequential and parallel jobs

Page 87: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

89Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Load Balancing

• Problem– A perfect static balance is not

possible• Execution time of jobs is

unknown– Unbalanced systems may not be

efficient

• Solution– Process migration

• Prior to execution– Granularity must be small

• During execution– Cost must be evaluated

Page 88: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

90Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Fault Tolerance

• Large cluster must be fault tolerant– The probability of a fault is quite high

• Solution– Re-execution of applications in the failed

node• Not always possible or acceptable

– Checkpointing and migration• It may have a high overhead

• Difficult with some kind of applications– Applications that modify the environment

• Transactional behavior may be a solution

Page 89: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

91Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Managing Heterogeneous

Systems

• Compatible nodes but different characteristics– It becomes a load balancing problem

• Non compatible nodes– Binaries for each kind of node are needed– Shared data has to be in a compatible

format– Migration becomes nearly impossible

Page 90: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

92Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Scheduling Systems

• Kernel level– Very few take care of cluster

scheduling

• High-level applications do the scheduling– Distribute the work– Migrate processes– Balance the load– Interact with the users– Examples

• CODINE, CONDOR, NQS, etc

Page 91: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

93Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Memory Management

• Objective– Use all the memory available in the cluster

• Basic approaches– Software distributed-shared memory

• General purpose– Specific usage of idle remote memory

• Specific purpose– Remote memory paging– File-system caches or RAMdisks

(described later)

Page 92: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

94Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Software Distributed Shared Memory

• Software layer– Allows applications running on different nodes to

share memory regions– Relatively transparent to the programmer

• Address-space structure– Single address space

• Completely transparent to the programmer– Shared areas

• Applications have to mark a given region as shared

• Not completely transparent• Approach mostly used due to its simplicity

Page 93: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

95Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Main Data Problems to be Solved

• Data consistency vs. Performance– A strict semantic is very inefficient

• Current systems offer relaxed semantics

• Data location (finding the data)– The most common solution is the owner

node• This node may be fixed or vary

dynamically

• Granularity– Usually a fixed block size is implemented

• Hardware MMU restrictions• Leads to “false sharing”• Variable granularity being studied

Page 94: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

96Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Other Problems to be Solved

• Synchronization– Test-and-set-like mechanisms cannot be used– SDSM systems have to offer new mechanisms

• i.e. semaphores (message passing implementation)

• Fault tolerance– Very important and very seldom implemented– Multiples copies

• Heterogeneity– Different page sizes– Different data-type implementations

• Use tags

Page 95: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

97Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Remote-Memory Paging

• Keep swapped-out pages in idle memory– Assumptions

• Many workstations are idle• Disks are much slower than Remote memory

– Idea• Send swapped-out pages to idle workstations• When no remote memory space then use disks• Replicate copies to increase fault tolerance

• Examples– The global memory service (GMS)– Remote memory pager

Page 96: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

98Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

I/O Management

• Advances very closely to parallel I/O– There are two major differences

• Network latency• Heterogeneity

• Interesting issues– Network configurations– Data distribution– Name resolution– Memory to increase I/O performance

Page 97: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

99Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Configurations

• Device location– Attached to nodes

• Very easy to have (use the disks in the nodes)

– Network attached devices• I/O bandwidth is not limited by memory

bandwidth

• Number of networks– Only one network for everything– One special network for I/O traffic (SAN)

• Becoming very popular

Page 98: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

100Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Data Distribution

• Distribution per files– Each nodes has its own independent file system

• Like in distributed file systems (NFS, Andrew, CODA, ...)

– Each node keeps a set of files locally• It allows remote access to its files

– Performance• Maximum performance = device performance• Parallel access only to different files• Remote files depends on the network

– Caches help but increase complexity (coherence)

– Tolerance• File replication in different nodes

Page 99: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

101Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Data Distribution

• Distribution per blocks– Also known as Software/Parallel RAIDs

• xFS, Zebra, RAMA, ...– Blocks are interleaved among all disks– Performance

• Parallel access to blocks in the same file• Parallel access to different files• Requires a fast network

– Usually solved with a SAN• Especially good for large requests

(multimedia)– Fault tolerance

• RAID levels (3, 4 and 5)

Page 100: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

102Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Name Resolution

• Equal than in distributed systems– Mounting remote file systems

• Useful when the distribution is per files– Distributed name resolution

• Useful when the distribution is per files– Returns the node where the file resides

• Useful when the distribution is per blocks– Returns the node where the file’s meta-

data is located

Page 101: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

103Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Caching

• Caching can be done at multiple levels– Disk controller– Disk servers– Client nodes– I/O libraries– etc.

• Good to have several levels of cache• High levels decrease hit ratio of

low levels– Higher level caches absorb most of the

locality

Page 102: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

104Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Cooperative Caching

• Problem of traditional caches– Each nodes caches the data it needs

• Plenty of replication• Memory space not well used

• Increase the coordination of the caches– Clients know what other clients are

caching• Clients can access cached data in

remote nodes– Replication in the cache is reduced

• Better use of the memory

Page 103: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

105Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

RAMdisks

• Assumptions– Disks are slow and

memory+network is fast– Disks are persistent and memory is

not

• Build “disk” unifying idle remote RAM– Only used for non-persistent data

• Temporary data– Useful in many applications

• Compilations• Web proxies• ...

Page 104: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

106Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Single-System Image

• SSI offers the idea that the cluster is a single machine

• It can be done a several levels– Hardware

• Hardware DSM– System software

• It can offers unified view to applications– Application

• It can offer a unified view to the user

• All SSI have a boundary

Page 105: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

107Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Key Services of SSI

• Main services offered by SSI– Single point of entry– Single file hierarchy– Single I/O Space– Single point of management– Single virtual networking– Single job/resource management

system– Single process space– Single user interface

• Not all of them are always available

Page 106: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

108Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Monitoring Clusters

• Clusters need tools to be monitored– Administrators have many things to

check– The cluster must be visible from a

single point

• Subjects of monitoring– Physical environment

• Temperature, power, ..– Logical services

• RPCs, NFS, ...– Performance meters

• Paging, CPU load, ...

Page 107: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

109Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Monitoring Heterogeneous

Clusters

• Monitoring is specially necessary in heterogeneous clusters– Several node types– Several operating systems

• The tool should hide the differences– The real characteristics are only needed to

solve some problems

• Very related to Single-System Image

Page 108: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

110Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Auto-Administration

• Monitors know how to make self diagnosis

• Next step is to run corrective procedures– Some systems start to do so (NetSaint)– Difficult because tools do not have

common sense

• This step is necessary– Many nodes– Many devices– Many possible problems– High probability of error

Page 109: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

111Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

High Availability

• One of the key points for clusters– Specially needed for commercial applications

• 7 days a week and 24 hours a day– Not necessarily very scalable (32 nodes)

• Based on many issues already described– Single-system image

• Hide any possible change in the configuration

– Monitoring tools• Detect the errors to be able to correct them

– Process migration• Restart/continue applications in running

nodes

Page 110: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

112Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Design and Build a Cluster Computer

Page 111: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

113Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Cluster Design

• Clusters are good as personal supercomputers

• Clusters are not often good as general purpose multi-user production machines

• Building such a cluster requires planning and understanding design tradeoffs

Page 112: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

114Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Scalable Cluster Design Principles

• Principle of Independence• Principle of Balanced Design• Principle of design for

Scalability• Principle of Latency hiding

Page 113: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

115Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of independence

• Components (hardware & Software) of the system should be independent of one another

• Incremental scaling - Scaling up a system along one dimension by improving one component, independent of others

• For example – upgrade processor to next generation, system should operate at higher performance with upgrading other components.

• Should enable heterogeneity scalability

Page 114: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

116Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of independence

• The components independence can result in cost cutting

• The component becomes a commodity, with following features– Open architecture with standard interfaces to

the rest of the system– Off-the-shelf product; Public domain– Multiple vendor in the open market with large

volume– Relatively mature– For all these reasons – the commodity

component has low cost, high availability and reliability

Page 115: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

117Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of independence

• Independence principle and application examples– The algorithm should be independent of the

architecture– The application should be independent of platform– The programming language should be independent of

the machine– The language should be modular and have orthogonal

feature– The node should be independent of the network, and

the network interface should be independent of the network topology

• Caveat– In any parallel system, there is usually some key

component/technique that is novel– We can not build en efficient system by simply scaling

up one or few components

• Design should be balanced

Page 116: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

118Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of Balanced Design

• Minimize any performance bottleneck• Should avoid an unbalanced system design,

where slow component degrades the performance of the entire system

• Should avoid single point of failure

• Example – – The PetaFLOP project – The memory requirement

for wide range of scientific/Engineering applications

• Memory (GB) ~ Speed3/4 (Gflop/s) • 30 TB of memory is appropriate for a Pflop/s

machine.

Page 117: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

119Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of Design for Scalability

• Provision must be made so that System can either scale up to provide higher performance

• Or scale down to allow affordability or greater cost-effectiveness

• Two approaches– Overdesign– Example – Modern processors support 64-bit

address space. This huge address may not be fully utilized by Unix supporting 32-bit address space. This overdesign will create much easier transition of OS from 32-bit to 64-bit

– Backward compatibility– Example – A parallel program designed to run on n

nodes should be able to run on a single node, may be with a reduced input data.

Page 118: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

120Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principle of Latency Hiding

• Future scalable system are most likely to use a distributed shared-memory architecture.

• Access to remote memory may experience a long latencies– Example – GRID

• Scalable multiprocessors clusters must rely on use of – Latency hiding– Latency avoiding– Latency reduction

Page 119: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

121Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Cluster Building

• Conventional wisdom: Building a cluster is easy– Recipe:

• Buy hardware from Computer Shop• Install Linux, Connect them via network• Configure NFS, NIS• Install your application, run and be happy

• Building it right is a little more difficult– Multi user cluster, security, performance tools– Basic question - what works reliably?

• Building it to be compatible with Grid– Compilers, libraries– Accounts, file storage, reproducibility

• Hardware configuration may be an issue

Page 120: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

122Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

How do people think of parallel programming and using clusters …..

Page 121: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

123Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Panther Cluster

• Picked 8 PC and named them from Panther family.

• Connected them by network and setup this cluster in a small lab.

• Using Panther Cluster– Select a PC & log in– Edit and Compile the

code– Execute the program– Analyze the results

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPantherLionLion

CatCat

Page 122: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

124Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Panther Cluster - Programming

• Explicit parallel Programming isn't easy. You really have to do yourself

• Network bandwidth and Latency matter

• There are good reasons for security patches

• Oops…Lion does not have floating point

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPantherLionLion

CatCat

Page 123: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

125Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Panther Cluster Attacks Users

• Grad Students wanted to use the cool cluster. They each need only half (a half other than Lion)

• Grad Students discover that using the same PC at the same time is incredibly bad

• A solution would be to use parts of the cluster exclusively for one job at a time.

• And so….

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPantherLionLion

CatCat

Page 124: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

126Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

We Discover Scheduling

• We tried– A sign up sheet– Yelling across the yard– A mailing list– ‘finger schedule– …– A scheduler

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPantherLionLion

CatCat

Job 1

Job 2

Job 3

Queue

Page 125: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

127Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Panther Expands

• Panther expands, adding more users and more systems

• Use Panther Node for– Login– File services– Scheduling services– ….

• All other compute nodes

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPanther

LionLion

CatCat

PC1PC1

PC2PC2

PC3PC3

PC4PC4

PC5PC5

PC6PC6

PC7PC7

PC8PC8PC 09 PC 09

Page 126: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

128Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Evolution of Cluster Services

Basic goal:

Login

File service

Scheduling

Management

I/O services

File service

Scheduling

Management

I/O services

Login

Scheduling

Management

I/O services

Login

File service

Management

I/O services

Login

File serviceScheduling

I/O services

Login

File serviceSchedulingManagement

The Cluster grows

Improve computing performance

Improve system reliability and manageability

Page 127: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

129Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Compute @ Panther

• Usage Model– Login to “login” node– Compile and test code– Schedule a test run– Schedule a serious run– Carry out I/O through I/O

node

• Management Model– The compute nodes are

identical – Users use Login, I/O and

compute nodes– All I/O requests are

managed by Metadata server

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

LoginLogin

LionLion

CatCat

PC1PC1

PC2PC2

PC3PC3

PC4PC4

PC5PC5

PC6PC6

PC7PC7

PC8PC8PC 09 PC 09

FileFile SchedSched MgmtMgmt

I/OI/O

Page 128: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

130Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Cluster Building Block

Page 129: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

131Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• Processor– Complex Instruction Set Computer (CISC)

• x86, Pentium Pro, Pentium II, III, IV– Reduced Instruction Set Computer (RISC)

• SPARC, RS6000, PA-RISC, PPC, Power PC

– Explicitly Parallel Instruction Computer (EPIC)

• IA-64 (McKinley), Itanium

Page 130: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

132Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• Memory– Extended Data Out (EDO)

• pipelining by loading next call to or from memory

• 50 - 60 ns– DRAM and SDRAM

• Dynamic Access and Synchronous (no pairs)• 13 ns

– PC100 and PC133• 7ns and less

Page 131: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

133Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• Cache– L1 - 4 ns– L2 - 5 ns– L3 (off the chip) - 30 ns

• Celeron– 0 – 512KB

• Intel Xeon chips– 512 KB - 2MB L2 Cache

• Intel Itanium – 512KB -

• Most processors have at least 256 KB

Page 132: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

134Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• Disks and I/O– IDE and EIDE

• IBM 75 GB 7200 rpm disk w/ 2MB onboard cache

– SCSI I, II, II and SCA• 5400, 7400, and 10000 rpm• 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s• Can chain from 6-15 disks

– RAID Sets• software and hardware• best for dealing with parallel I/O• reserved cache for before flushing to

disks

Page 133: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

135Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• System Bus– ISA

• 5Mhz - 13 Mhz– 32 bit PCI

• 33Mhz• 133 MB/s

– 64 bit PCI• 66Mhz• 266MB/s

Page 134: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

136Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Hardware

• Network Interface Cards (NICs)– Ethernet - 10 Mbps, 100 Mbps, 1 Gbps– ATM - 155 Mbps and higher

• Quality of Service (QoS)– Scalable Coherent Interface (SCI)

• 12 microseconds latency– Myrinet - 1.28 Gbps

• 120 MB/s• 5 microseconds latency

– PARAMNet – 2.5 Gbps

Page 135: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

137Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks – Operating System

• Solaris - Sun• AIX - IBM• HPUX - HP• IRIX - SGI• Linux - everyone!

– Is architecture independent

• Windows NT/2000

Page 136: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

138Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Compilers

• Commercial• Portland Group Incorporated

(PGI)– C, C++, F77, F90– Not as expensive as vendor

specific and compile most applications

• GNU– gcc, g++, g77, vast f90– free!

Page 137: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

139Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks - Scheduler

• Cron, at (NT/2000)• Condor

– IBM Loadleveler– LSF

• Portable Batch System (PBS)

• Maui Scheduler• GLOBUS• All free, run on more than

one OS!

Page 138: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

140Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks – Message Passing

• Commercial and free• Naturally Parallel, Highly

Parallel• Condor

– High Throughput Computing (HTC)

• Parallel Virtual Machine PVM– oak ridge national labs

• Message Passing Interface (MPI)– mpich from anl

Page 139: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

141Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Blocks – Debugging and Analysis

• Parallel Debuggers– TotalView

• GUI based

• Performance Analysis Tools– monitoring library calls and runtime

analysis– AIMS, MPE, Pablo, – Paradyn - from Wisconsin, – SvPablo, Vampir, Dimemas, Paraver

Page 140: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

142Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Building Block – Other

• Cluster Administration Tools

• Cluster Monitoring Tools

These tools are the part of Single System Image Aspects

Page 141: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

143Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Processors

Performance

Cluster of Uniprocessors

Cluster of SMPs

SMP

Scalability of Parallel Processors

Page 142: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

144Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Installing the Operating System

• Which package ?

• Which Services ?

• Do I need a graphical environment ?

Page 143: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

145Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Identifying the hardware bottlenecks

• Is my hardware optimal ?

• Can I improve my hardware choices ?

• How can I identify where is the problem ?

• Common hardware bottlenecks !!

Page 144: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

146Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Benchmarks

• Synthetic Benchmarks– Bonnie– Stream– NetPerf– NetPipe

• Applications Benchmarks– High Performance Linpack– NAS

Page 145: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

147Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Networking under Linux

Page 146: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

148Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Terminology Overview

• IP address: the unique machine address on the net (e.g., 128.169.92.195)

• netmask: determines which portion of the IP address specifies the subnetwork number, and which portion specifies the host on that subnet (e.g., 255.255.255.0)

• network address: IP address masked bitwise-ANDed with the netmask (e.g.,128.169.92.0)

• broadcast address: network address ORed with the negation of the netmask (128.169.92.255)

Page 147: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

149Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Terminology Overview

• gateway address: the address of the gateway machine that lives on two different networks and routes packets between them

• name server address: the address of the name server that translates host names into IP addresses

Page 148: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

150Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

A Cluster Network

server(galaxy)

Client(star1)

Client(star2)

Client(star3)

switch

Outside IP(Get in using SSH only)

192.168.1.3192.168.1.2192.168.1.1

192.168.1.100

NIS domain name = workshop

Private IPs(Allow RSH)

Page 149: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

151Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Configuration

IP Address

• Three private IP address range – – 10.0.0.0 to 10.255.255.255; 172.16.0.0 to

172.32.255.255; 196.168.0.0 to 192.168.255.255– Information on private intranet is available in RFC

1918

• Warning: Should not use IP address 10.0.0.0 or 172.16.0.0 or 196.168.0.0 for server

• Netmask – 255.255.255.0 should be sufficient for most clusters

Page 150: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

152Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Configuration

DHCP : Dynamic Host Configuration Protocol

• Advantages – You can simplify network setup

• Disadvantages– It is centralized solution ( is it scalable ?)– IP addresses are linked to ethernet

address, and that can be a problem if you change the NIC or want to change the hostname routinely

Page 151: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

153Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Configuration Files

• /etc/resolv.conf -- configures the name resolver specifying the following fields:

• search (a list of alternate domain names to search for a hostname)

• nameserver (IP addresses of DNS used for name resolutions) search cse.iitd.ac.in nameserver 128.169.93.2 nameserver 128.169.201.2

Page 152: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

154Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Configuration Files

• /etc/hosts -- contains a list of IP addresses and their corresponding hostnames. Used for faster name resolution process (no need to query the domain name server to get the IP address) 127.0.0.1 localhost localhost.localdomain 128.169.92.195 galaxy galaxy.cse.iitd.ac.in 192.168.1.100 galaxy galaxy 192.168.1.1 star1 star1

• /etc/host.conf -- specifies the order of queries to resolve host names Example: order hosts, bind # check the /etc.../hosts first and

then the DNS multi on # allow to have multiple IP

addresses

Page 153: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

155Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Host-specific Configuration Files

• /etc/conf.modules -- specifies the list of modules (drivers) that have to be loaded by the kerneld (see /lib/modules for a full list) – alias eth0 tulip

• /etc/HOSTNAME - specifies your system hostname: galaxy1.cse.iitd.ac.in

• /etc/sysconfig/network -- specifies a gateway host, gateway device – NETWORKING=yes HOSTNAME=galaxy.cse.iitd.ac.in GATEWAY=128.169.92.1 GATEWAYDEV=eth0 NISDOMAIN=workshop

Page 154: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

156Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Configure Ethernet Interface

• Loadable ethernet drivers -– Loadable modules are pieces of object codes that

can be loaded into a running kernel. It allows Linux to add device drivers to a running Linux system in real time. The loadable Ethernet drivers are found in the /lib/modules/release/net directory

– /sbin/insmod tulip– GET A COMPATIBLE NIC!!!!!!

• ifconfig command assigns TCP/IP configuration values to network interfaces– ifconfig eth0 128.169.95.112 netmask 255.255.0.0

broadcast 128.169.0.0– ifconfig eth0 down

• To set default gatway– route add default gw 128.169.92.1 eth0

• GUI ; – system -> control panel

Page 155: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

157Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• Some useful commands:

– lsmod -- shows information about all loaded modules

– insmod -- installs a loadable module in the running kernel (e.g., insmod tulip)

– rmmod -- unloads loadable modules from the running kernel (e.g., rmmod tulip)

– ifconfig -- sets up and maintains the kernel-resident network interfaces (e.g., ifconfig eth0, ifconfig eth0 up)

– route -- shows / manipulates the IP routing table

Troubleshooting

Page 156: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

158Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network File System (NFS)

Page 157: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

159Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NFS

• NFS has a number of characteristics:– makes sharing of files over a network possible ( and

transparent to the users)

– allows keeping data files consuming large amounts of disk space on a single server (/usr/local, etc....)

– allows keeping all user accounts on one host (/home)

– causes security risks

– slows down performance

• How NFS works:– a server specifies which directory can be mounted from

which host (in the /etc/exports file)

– a client mounts a directory from a remote host on a local directory (the mountd mount daemon on a server verifies the permissions)

– when someone accesses a file over NFS, the kernel places an RPC (Remote Procedure Call) call to the nfs NFS daemon on the server

Page 158: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

160Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NFS Configuration

• On both client and server:

• Make sure your kernel has NFS support compiled by running

– cat /proc/filesystems

• The following daemons must be running on each host (check by running ps aux or run setup -> System Services, switch on the daemons, and reboot):

– /etc/rc.d/init.d/portmap (or rpc.portmap)

– /usr/sbin/rpc.mountd

– /usr/sbin/rpc.nfsd

• Check that mountd and nfsd are running properly by running /usr/sbin/rpcinfo -p

• If you experience problems, try to restart portmap, mountd, and nfsd daemon in sequence

Page 159: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

161Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NFS Server Setup

• On an NFS server:

– Edit the /etc/exports file to specify a directory to be mounted, host(s) allowed to mount, and the permissions (e.g., /home client.cs.utk.edu(rw) gives client.cs.utk.edu host read/write access to /home)

– Change the permissions of this file to world readable :

• chmod a+r /etc/exports

– Run /usr/sbin/exportfs to restart mountd and nfsd that read /etc/exports files (or run /etc/rc.d/init.d/nfsfs restart

/home star1 (rw)/use/export star2 (ro)/home galaxy10.xx.iitd.ac.in (rw)/mnt/cdrom galaxy13.xx.iitd.ac.in (ro)

Page 160: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

162Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NFS Client Setup

mount -t nfs remote_host:remote_dir local_dir

• On an NFS client: – To mount a remote directory use the command:

mount -t nfs remote_host:remote_dir local_dir where it is assumed that local_dir exists Example:

mount -t nfs galaxy1: /usr/local /opt – To make the system mount an nfs file system upon boot,

edit the /etc/fstab file like this: server_host:dir_on_server local_dir nfs rw, auto 0 0

Example: galaxy1:/home /home nfs defaults 0 0

– To unmount the file system: umount /local_dir

• Note: Check with the df command (or cat /etc/mtab) to see mounted/unmounted directories.

Page 161: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

163Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• To forbid suid programs to work off the NFS file system, specify nosuid option in /etc/fstab

• To deny the root user on the client to access and change files that only root on the server can access or change, specify root_squash option in /etc/exports; to grant client root access to a filesystem use no_root_squash

• To prevent the access to NFS files on a server without privileges specify portmap: ALL in /etc/hosts.deny

NFS: Security Issues

Page 162: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

164Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Network Information Service (NIS)

Page 163: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

165Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NIS

• NIS, Network Information Service, is a service that provides information (e.g., login names/passwords, home directories, group information), that has to be known throughout the network. This information can be authentication, such as passwd and group files, or it can be informational, such as hosts files.

• NIS was formerly called Yellow Pages (YP), thus yp is used as a prefix on most NIS-related commands

• NIS is based on RPC (remote procedure call)

• NIS keeps database information in maps (e.g., hosts.byname, hosts.byaddr) located in /var/yp/nis.domain/

• The NIS domain is a collection of all hosts that share part of their system configuration data through NIS

Page 164: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

166Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• Within an NIS domain there must be at least one machine acting as a NIS server. The NIS server maintains the NIS database - a collection of DBM files generated from the /etc/passwd, /etc/group, /etc/hosts and other files. If a user has his/her password entry in the NIS password database, he/she will be able to login on all the machines on the network which have the NIS client programs running.

• When there is a NIS request, the NIS client makes an RPC call across the network to the NIS server it was configured for. To make such a call, the portmap daemon should be running on both client and server.

• You may have several so-called slave NIS servers that have copies of the NIS database from the master NIS server. Having NIS Slave servers is useful when there are a lot of users on the server or in case the master server goes down. In this case, a NIS client tries to connect to the faster active server.

How NIS Works

Page 165: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

167Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Authentication Order

• /etc/nsswitch.conf (server) passwd: files nis group: files nis shadow: files nis hosts: files nis dns

• The /etc/nsswitch.conf file is used by the C library (glibs) to determine what order to query for information.

• The syntax is: service: <method> <method> …

• Each method is queried on order until one returns the requested information. In the example above, the NIS server is queried first for password information and if it fails, the local /etc/password file is checked. Once all methods are queried, if no answer is found, an error is returned.

• /etc/nsswitch.conf (client)

passwd: nis files

group: nis files

shadow: nis files

hosts: nis files dns

Page 166: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

168Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NIS Client Setup• Setup the name of your NIS domain: /sbin/domainname

nis.domain• The ypbind (under /etc/rc.d/init.d) daemon (switch it on from

setup command) handles the clients requests for NIS information.

• The ypbind needs a configuration file /etc/yp.conf that contains information on how the client is supposed to behave (broadcast the request or send it directly to a server).

• The /etc/yp.conf file can use three different options:– domain NISDOMAIN server HOSTNAME– domain NISDOMAIN broadcast (default)– ypserver HOSTNAME

• Create a directory /var/yp (if one does not exist)• start up /etc/rc.d/init.d/portmap• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypbind to check if ypbind was able to

register its service with the portmapper• Specify the NISDOMAIN in /etc/sysconfig/network:

– NISDOMAIN=workshop

Page 167: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

169Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NIS Server Setup

• /var/yp contains one directory for each domain that the server can respond to. It also has the Makefile used to update the data base maps.

• Generate the NIS database: /usr/lib/yp/ypinit -m

• To update a map (i.e. after creating a new user account), run make in the /var/yp directory

• Makefile controls which database maps are built for distribution and how they are built.

• Make sure that the following daemons are running:

– portmap, ypserv, yppasswd

• You may switch these daemons on from the setup->System Services. They are located under /etc/rc.d/init.d.

• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypserv to check if ypserv was able to register its service with the portmapper.

• Verify that the NIS services work properly, run ypcat passwd or ypmatch userid passwd

Page 168: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

170Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Working Cluster

server(galaxy)

Client(star1)

Client(star2)

Client(star3)

switch

Outside IP(Get in using SSH only)

192.168.1.3192.168.1.2192.168.1.1

192.168.1.100

NIS domain name = workshop

Private IPs(Allow RSH)

Page 169: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

171Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Kernel ??

Page 170: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

172Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Kernel Concepts

• The Kernel acts as a mediator for your programs and hardware:– Manages the memory for all running processes– Makes sure that the procs share CPU cycles appropriately– Provides an interface for the programs to the hardware

• Check the config.in file (normally under /usr/src/linux/arch/i386) to find out what kind of hardware the kernel supports.

• You can enable the kernel support for your hardware or programs (network, NFS, modem, sound, video, etc.) in two ways:– by linking the pieces of the kernel code directly into the kernel (in

this case, they will be “burnt into” the kernel which increases the size of the kernel): monolithic kernel

– by loading those pieces as modules (drivers) upon request which is a more flexible and more preferable way: modular kernel

• Why upgrade?– To support more/newer types of hardware– improve process management– faster, more stable, bug fixes– more secure

Page 171: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

173Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Why Use Kernel Modules?

• Easier to test: no need to reboot the system to load and unload a driver

• Less memory usage: modular kernels are smaller in size than monolithic kernels. Memory used by the kernel is never swapped out. Unused drivers compiled into your kernel waste your RAM.

• One single boot image: No need for building different boot images depending on the hardware needs

• If you have a diskless machine then you can not use the modular kernel since the modules (drivers) are normally stored on the hard disk.

• The following drivers cannot be modularized:– the driver for the hard disk where your root system

resides

– the root filesystem driver

– the binary format loader for init, kerneld and other programs

Page 172: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

174Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Which module to load?

• The kerneld daemon allows kernel modules (device drivers, network drivers, filesystems) to be loaded automatically when they are needed, rather than doing it manually via insmod or modprobe.

• When there is a request for a certain device, service, or protocol, the kernel sends this request to the kerneld daemon. The daemon determines what module should be loaded by scanning the configuration file /etc/conf.modules.

• You may create the /etc/conf.modules file by running:– /sbin/modprobe -c | grep -v ‘^path’ > /etc/conf.modules

• /sbin/modprobe -c : get a listing of the modules that the kerneld knows about

• /sbin/lsmod: lists all currently loaded modules

• /sbin/insmod: installs a loadable module in the running kernel

• /sbin/rmmod: unload loadable modules

Page 173: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

175Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Configuration and Installation Steps

• 1. Check the /usr/src/linux directory on the system. If it is empty, you need to get the kernel package (eg., ftp.kernel.org/pub/linux/kernel) and

• unpack the package by typing:– $ cd kernel_src_tree

– $ tar zpxf linux-verion.tar.gz

• 2. Configure the kernel: ‘make config’ or ‘make xconfig’

• 3. Check the correct dependencies: ‘ make desp

• 4. Clean stale object files: ‘make clean’ or ‘make mrproper’

• 5. Compile the kernel image: ‘make bzImage’ or ‘make zImage’

• 6. Install the kernel: ‘make bzlilo’ or re-install LILO by:– making the backup of the old kernel /vmlinuz

– copying /usr/src/linux/arch/i386/boot/bzImage to /vmlinuz

– running lilo: $ /sbin/lilo

• 7. Compile the kernel modules: ‘make modules’

• 8. Install the kernel modules: ‘make modules_install’

Page 174: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

176Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS: Portable Batch System

(www.OpenPBS.org)

Why PBS?PBS Architecture

Installation and Configuration

Server ConfigurationConfiguration Files

PBS DaemonsPBS User Commands

PBS System Commands

Page 175: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

177Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS

• Develop by Veridian MRJ for NASA• POSIX 1003.2d Batch Environment Standard Compliant• Supports nearly all UNIX-like platforms (Linux, SP2,

Origin2000)

This figure has been taken from http://www.extremelinux.org/activities/usenix99/docs/pbs/sld007.htm

Page 176: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

178Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS Components

• PBS consists of four major components:– commands -used to submit, monitor, modify, and delete

jobs

– job server (pbs_server) - provides the basic batch services (receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job). All commands and the other daemons communicate with the server.

– job executor, or MOM, mother of all the processes (pbs_mom) - the daemon that places the job into execution, and later returns the job’s output to the user

– job scheduler (pbs_sched) controls which job is to be run and where and when it is run based on the implemented policies. It also communicates with MOMs to get information about system resources and with the server daemon to learn about the availability of jobs to execute.

Page 177: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

179Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS Installation and Configuration

• Package Configuration: run configuration script• Package Installation:

– Compile the PBS modules: “make” – Install the PBS modules: “make install”

• Server Configuration:– Create a node description file:

$PBS_HOME/server_priv/nodes– One time only: “pbs_server -r create”– Configure the server by running “qmgr”

• Scheduler Configuration:– Edit $PBS_HOME/sched_priv/sched_config file defining

scheduling policies (first-in,first-out / round-robin/ load balancing)

– Start a scheduler program: “pbs_sched”

• Client Configuration:– Edit $PBS_HOME/mom_priv/config file: MOM’s config file– Start the execution server: “pbs_mom”

Page 178: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

180Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Server & Client Configuration Files

• To configure the server you should edit the file $PBS_HOME/server_priv/nodes:

• # nodes that this server will run jobs on.

• #node properties

• galaxy10 even

• galaxy11 odd

• galaxy12 even

• galaxy13 odd

• To configure the MOM, you should edit the file $PBS_HOME/mom_priv/config:

• $clienthost galaxy10

• $clienthost galaxy11

• $clienthost galaxy12

• $clienthost galaxy13

• $restricted *.cs.iitd.ac.in

• See the PBS Administrator’s Guide for details.

Page 179: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

181Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS Daemons

batch server (pbs_server)MOM (pbs_mom)scheduler (pbs_sched)

• To autostart PBS daemons at boot time add the

appropriate line in the

• file /etc/rc.d/rc.local:

– /usr/local/pbs/sbin/pbs_server # on the PBS

server

– /usr/local/pbs/sbin/pbs_sched # on the PBS

server

– /usr/local/pbs/sbin/pbs_mom # on each PBS

Mom

Page 180: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

182Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS User Commands

• qsub job_script: submit job_script file to PBS

• qsub -I : submit an interactive-batch job

• qstat -a : list all jobs on the system. It provides

the following job information:

• the job IDs for each job

• the name of the submission script

• the number of nodes required/used by each job

• the queue that the job was submitted to.

• qdel job_id : delete a PBS job

• qhold job_id : put a job on hold

Page 181: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

183Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PBS System Commands

• Note: Only the manager can use these commands:

• qmgr: pbs batch system manager

• qrun : used to force a batch server to initiate the execution of a batch job. The job is run regardless of scheduling position, resource requirements, or state

• pbsnodes : pbs node manipulation

• pbsnodes -a : lists all nodes with their attributes• xpbs : Graphical User Interface to PBS command

• xpbs -admin : Running xpbs in administrative mode

• xpbsmon - GUI for displaying, monitoring the nodes/execution hosts under PBS

Page 182: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

184Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Typical Usage

$ qsub -lnodes=4 -lwalltime=1:00:00 script.sh where script.sh may look like this: #! /bin/sh mpirun -np 4 matmul

• Sample script that can be used for MPICH jobs:• #!/bin/bash• #PBS -j oe• #PBS -l nodes=2,walltime=01:00:00• cd $PBS_O_WORKDIR• /usr/local/mpich/bin/mpirun \• -np `cat $PBS_NODEFILE | wc -l` \• -leave_pg \• -machinefile $PBS_NODEFILE \• $PBS_O_WORKDIR/matmul

Page 183: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

185Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Installation of PVFS: Parallel Virtual File System

Page 184: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

186Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PVFS - Objectives

•To meet the need for a parallel file system for Linux Cluster

•Provide high bandwidth for concurrent read/write operations from multiple processes or threads to a common file•No change in the working of the common UNIX shell command like ls, cp, rm•Support for multiple APIs : a native PVFS API, the UNIX/POSIX API, MPI-IO API•Scalable

Page 185: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

187Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

PVFS - Features

• Clusterwide consistent name space• User controlled stripping of data

across disks on different I/O nodes• Existing binaries operate on PVFS files

without the need for recompiling• User level implementation : no kernel

modifications are necessary for it to function properly

Page 186: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

188Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Design Features

• Manager Daemon with Metadata

• I/O Daemon

• Client Library

Page 187: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

189Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Metadata

• Metadata : information describing the characteristics of a file e.g owner and the group, permissions , physical distribution of the file

Page 188: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

190Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

[root@head /]# mkdir /pvfs-meta[root@head /]# cd /pvfs-meta

[root@head /pvfs-meta]# /usr/local/bin/mkmgrconf This script will make the .iodtab and .pvfsdir files in the metadata directory of a PVFS file system.

Enter the root directory: /pvfs-meta Enter the user id of directory: root Enter the group id of directory: root Enter the mode of the root directory: 777 Enter the hostname that will run the manager:

Installing Metadata Server

Page 189: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

191Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

head Searching for host...success Enter the port number on the host for manager: (Port number 3000 is the default) 3000 Enter the I/O nodes: (can use form node1, node2, ... or nodename#-#,#,#) node0-7 Searching for hosts...success I/O nodes: node0 node1 node2 node3 node4 node5 node6 node7 Enter the port number for the iods: (Port number 7000 is the default) 7000 Done!

…Continue

Installing Metadata Server

Page 190: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

192Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

The Manager

• Applications communicate directly with the PVFS manager (via TCP) when opening/ creating/closing/removing files

• Manager returns the location of the I/O nodes on which file data is stored

• Issue : Presentation of directory hierarchy of PVFS files to Application process

1. NFS : Drawbacks : NFS has to be mounted on all nodes in the cluster, Default caching of NFS caused problem with metadata operations

2. System calls related to directory access. Mapping routine determines directory access, operations are redirected to PVFS Manager

Page 191: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

193Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

I/O Daemon

• Multiple servers in a client server system, run on separate I/O nodes in the cluster

• PVFS files are striped across the discs on the I/O nodes

• Handles all file I/O

                                         

Page 192: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

194Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

• On each of the machines that will act as I/O servers.

• Create a directory that will be used to store PVFS file data.

• Set the appropriate permissions on that directory.

[root@node0 /]# mkdir /pvfs-data[root@node0 /]# chmod 700 /pvfs-data

[root@node0 /]# chown nobody.nobody /pvfs-data

• Create a configuration file for the I/O daemon on each of your I/O servers.

[root@node0 /]# cp /usr/src/pvfs/system/iod.conf /etc/iod.conf

[root@node0 /]# ls -al /etc/iod.conf-rwxr-xr-x 1 root root 57 Dec 17 11:22 /etc/iod.conf

Configuring the I/O servers

Page 193: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

195Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Starting the PVFS daemons

PVFS manager handles access control to files on PVFS file systems.[root@head /root]# /usr/local/sbin/mgr

I/O daemon handles actual file reads and writes for PVFS.[root@node0 /root]# /usr/local/sbin/iod

To verify that one of the iod's or mgr is running correctly on your system you may check it by running the iod-ping or mgr-ping utilities, respectively.

e.g.

[root@head /root]# /usr/local/bin/iod-ping -h node0

node0:7000 is responding.

Page 194: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

196Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Client Library

• Libpvfs : Links clients to PVFS servers

• Hides details of PVFS access from application tasks

• Provides “partitioned-file interface” : noncontiguous file regions can be accessed with a single function call

Page 195: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

197Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Client Library (contd.)

• User specification of file partition using a special ioctl call

• Parameters offset : how far into the

file partition begins relative to the first byte of the file

gsize : size of the simple strided region to be accessed

stride : distance between the start of two consecutive regions

Page 196: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

198Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Trapping Unix I/O Calls

• Applications call functions through the C library

• PVFS wrappers determine the type of file on which the operation is to be performed

• If the file is a PVFS file, the PVFS I/O library is used to handle the function, else parameters are passed to the actual kernel

• Limitation : Restricted portability to new architectures and operating system – a new module that can be mounted like NFS enables traversal and accessibility of the existing binaries

Page 197: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

199Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

We have built a Standard Beowulf

PVFS In

terc

on

nect

ion

Netw

ork

Missing Middleware for SSI

Page 198: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

200Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Linux Clustering Today

AvailabilityM

ult

iple

OS

s—D

istr

. Co

mp

uti

ng

Single OSs—single system image

•OPS, XPS (for DB Only)

Design criteria• Superior scalability• High availability• Single system manageability• Best price/performance

•Large SMP, NUMA•HPC

CommoditySMP or UP

MULTIPLE O

S’s - SSI

Availability issues

Scalability

Scalability andmanageability

issues SSI Clusters

•LifeKeeper•Sun Clusters•ServiceGuard

Source - Bruce Walker: www.opensource.compaq.com

Page 199: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

201Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

What is Single System Image (SSI)?

Co-operating OS Kernels providing transparent access to all OS resources cluster-wide, using a single namespace

Page 200: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

202Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Benefits of Single System Image

• Usage of system resources transparently

• Improved reliability and higher availability

• Simplified system management

• Reduction in the risk of operator errors

• User need not be aware of the underlying system architecture to use these machines effectively

Page 201: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

203Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

SSI vs. Scalability

Page 202: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

204Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Components of SSI Services

• Single Entry Point• Single File Hierarchy: xFS, AFS, Solaris MC

Proxy• Single I/O Space – PFS, PVFS, GPFS• Single Control Point: Management from single

GUI• Single virtual networking• Single memory space - DSM• Single Process Management: Glunix, Condin,

LSF• Single User Interface: Like workstation/PC

windowing environment (CDE in Solaris/NT), may it can use Web technology

Page 203: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

205Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Key Challenges

• Technical and Applications– Development Environment

• Compilers, Debuggers• Performance Tools

– Storage Performance• Scalable Storage• Common Filesystem

– Admin Tools• Scalable Monitoring Tools• Parallel Process Control

– Node Size• Resource Contention• Shared Memory Apps

– Few Users => Many Users• 100 Users/month

– Heterogeneous Systems• New generations of

systems– Integration with the Grid

• Organizational– Integration with Existing

Infrastructure• Accounts, Accounting• Mass Storage

– Acceptance by Community

• Increasing Quickly• Software

environments

Page 204: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

206Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Tools for Installation and Management

Page 205: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

207Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

What have we learned so far ?

• Installing and maintaining a Linux cluster is very difficult and time consuming task

• Need certainly tools for Installation and management

• Open Source Community has developed different software tools.

• Let evaluate them on the basis of – Installation, Configuration, Monitoring and Management aspects by answering following questions ---

Page 206: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

208Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Installation

• General questions:– Can we use a standard Linux distribution, or its included?– Can we use our favorite Linux distribution ? – Are multiple configurations or profiles in the same cluster

supported ?– Can we use heterogeneous hardware in the same cluster ?

• e.g. Node with different NICs

• Frontend Installation:– How to install the Cluster Software ?– Do we need to install additional software? e.g. Web server– Which service must be running and configured on the

frontend• DHCP, NFS, DNS etc

Page 207: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

209Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Installation

• Nodes Installation:– Can we install the nodes in automatic way ?– Is it necessary to introduce information about nodes

manually ?• MAC address, node names, hardware configuration

etc.– Or is it collected automatically ?– Can we boot the node from diskette or from NIC ?– Do we need a mouse, keyboard, monitor physically

attached to nodes ?

Page 208: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

210Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Configuration

• Frontend Configuration:– How do we configure the service in the fronend

machine ?• e.g.- How do we configure DHCP, or how to create a

PBS node file or how to create /etc/hosts file

• Node Configuration:– How do we configure the OS of the nodes ?– e.g. – Keyboard layout (English, spanish etc), disk

partitioning, timezone etc– How do we create and configure the initial users ?– e.g. – how do we generate ssh keys?– Can we modify the configuration of individual software

packages ?– Can we modify the configuration of the niodes once

they are installed ?

Page 209: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

211Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Monitoring

• Nodes Monitoring:– Can we monitor the status of the individual cluster

nodes ?• e.g. – load, amount of free memory, swap in use etc

– Can we monitor status of cluster as whole ?– Do we have graphical tool to display this information ?

• Detecting Problems:– How do we know if the cluster is working properly ?– How do we detect that there is a problem with one node

?• e.g. – The node has crashed, or its disk is full ?

– Can we have alarms ?– Can we have defined actions for certain events ?

• e.g. mail to the system manager if a network interface is down ?

Page 210: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

212Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Maintenance

• General Questions:– Is it easy to add or remove a node ?– Is is easy to add or remove software package ?– In case of installing a new packages, must it be

in RPM format ?

• Reinstallations– How do we reinstall a node ?

• Upgrading– Can we upgrade the OS ? – Can we upgrade the individual software

packages ?– Can we upgrade the nodes hardware ?

• Solving problems– Do we have tools to solve problems in a remote

way or is it necessary to telnet to nodes ?– Can we start a node remotely?

Page 211: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

213Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks

Page 212: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

214Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Who is NPACI Rocks?

• Cluster Computing Group at SDSC

• UC Berkeley Millennium Project– Provide Ganglia Support

• Linux Competency Centre in SCS Enterprise Systems Pte Ltd in Singapore– Provide PVFS Support– Working on user documentation for Rocks 2.2

Page 213: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

215Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks

• NPACI Rocks is Full Linux distribution based on Red Hat Linux

• Designed and created for – Installation, administration and use of Linux clusters.

• NPACI is the result of a collaboration between several research groups and industry, led by Cluster development group at San Diego Supercomputing Centre

• NPACI Rocks distribution come in three CD-ROMs (ISO image can be downloaded from web)

Page 214: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

216Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks – included software

Software packages for cluster management

• ClustAdmin: Tools for cluster administration. It includes programs like shoot-node to reinstall a node

• ClustConfig: Tools for cluster configuration. It includes software like an improved DHCP server

• eKV: keyboard and video emulation through Ethernet cards

• Rock-dist: a utility to create the Linux distribution used for nodes installation. This includes a modified version of Red Hat kickstart with support for eKV

• Gangila: a real time cluster monitoring tool (with a web based interface) and remote execution environment

• Cluster SQL: the SQL data base schema contains all node configuration information.

Page 215: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

217Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks – included software

Software packages for cluster Programming and Use

• PBS (Portable Batch System) and Maui Scheduler• A modified version of secure network connectivity

suite OpenSSH• Parallel programming libraries MPI and PVM,

with modification to work with OpenSSH• Support for Myrinet GM network interface cards

Page 216: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

218Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Basic Architecture

Front-end Node(s) Public Ethernet

Fast-Ethernet Switching Complex

Gigabit Network Switching Complex

Node Node Node Node Node

Node Node Node Node Node

Pow

er D

istribu

tion

(Net a

dd

ressa

ble

un

its as o

ptio

n)

Source : Mason Katz et .al , SDSC

Page 217: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

219Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Major Components

Source : Mason Katz et .al , SDSC

Standard Beowulf

Page 218: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

220Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Major Components

Source : Mason Katz et .al , SDSC

NPACI Rocks

Page 219: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

221Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Installation

• Installation procedure is based on the Red Hat kickstart installation program and a DHCP server

• Frontend Installation – – Create a configuration file for kickstart – Config file contain the information about

characteristics and configuration parameter• Such as domain name, language, disk

partitioning, root password etc– Automatic installation – insert the Rocks CD, a

floppy with the kickstart file, and reset the machine– When the installation is completes, CD will be

ejected and machine will reboot– At this point frontend machine completely installed

and configured

Page 220: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

222Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NAPCI Rocks - Installation

• Node Installation– Execute insert-ether program on frontend machine– This program is responsible to gather information about

the nodes by capturing DHCP requests– This program configure the necessary services in the

frontend • Such as NIS maps, MySQL database, DHCP config file

etc.

– Insert the Rocks CD into the first node of the cluster and switch it on

– If NO CD drive, floppy can be used. – Node will be installed using kickstart without user

intervention

Note: Once the installation starts we can follow its evolution from frontend just by telneting to node port 8080.

Page 221: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

223Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Configuration

• Rocks uses a MySQL database to store information about cluster global and node specific configurations parameters– e.g. – node names, Ethernet addresses, IP address etc

• The information for this database is collected in an automatic way when cluster nodes are installed for first time.

• The node configuration is done at installation time using Red Hat kickstart

• The kickstart configuration file for each node is created automatically with CGI script on the frontend machine

• At installation time, the node request their kickstart files via HTTP.

• The CGI script uses a set of XML-based file to construct a general kickstart file then it applies node-specific modification by querying the MySQL database

Page 222: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

224Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks – Monitoring

• Monitoring using – Unix tools or Gangila cluster toolkit

Gangila – Real time monitoring and remote execution – Monitoring is performed by a daemon – gmond– This daemon must be running on each node of the cluster– Collects the information about more than twenty matrices

• e.g. – CPU load, free memory, free swap etc)– This information is collected and stored by the frontend

machine– A web browser can be used to output in a graphical way

SNMP based monitoring– Process status on any machine can be inquired using

snmp-status script– Rocks lets to forward all the syslog to frontend machine

Page 223: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

225Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Ganglia

• Scalable cluster monitoring system– Based on ip multi-cast– Matt Massie, et al from UCB– http://ganglia.sourceforge.net

• Gmon daemon on every node– Multicasts system state– Listens to other daemons– All data is represented in XML

• Ganglia command line– Python code to parse XML to English

• Gmetric– Extends Ganglia– Command line to multicast single metrics

Page 224: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

226Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Ganglia

Source : Mason Katz et .al , SDSC

Page 225: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

227Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Maintenance

• Principle cluster maintenance – reinstallation

• To install, upgrade or remove software or change in OS, Rocks needs to perform a complete reinstallation of all the cluster nodes to propagate the changes

• Example – to install a new software – – copy the package (must be in RPM format) into

Rocks software repository– Modify the XML configuration file– Execute the program rocks-dist to make a new

Linux distribution– Reinstall all the nodes to apply the changes – use

shot-node

Page 226: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

228Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks

• Rocks uses a NIS map to define the users of the cluster

• Rocks uses NFS for mounting the user home directories from frontend

• If we want to add or to remove a node of the cluster we can use the installation tool insert-ether

Page 227: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

229Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Discussion

• Easy to install a Cluster under Linux• Does not require deep knowledge or

cluster architecture or system administration to install a cluster

• Rock provides a good solution for monitoring with Gangila

• Rocks philosophy for cluster configuration and maintenance– It becomes faster to reinstall all nodes to a

known configuration than it is to determine if nodes were out of synchronization in the first place

Page 228: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

230Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Discussion

Other good points• Installation program kickstart has advantage

over hard disk image cloning – results in good support for cluster made of heterogeneous hardware

• The eKV make almost unnecessary the use of monitor, keyboard physically attached to nodes

• The node configuration parameters are stored on a MySQL data base. This database can be queried to make reports about the cluster, or to program new monitoring and management scripts

• It is very easy to add or remove a user, we only have to modify the NIS map.

Page 229: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

231Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

NPACI Rocks - Discussion

Some Bad Points• Rocks is an extended Red Hat Linux 7.2

distribution. This means it can not work with other vendor distributions, or even other versions of Red Hat Linux

• All software installed under Rocks must be in RPM. If we have tar.gz we have create corresponding RPM

• The monitoring program does not issue any alarm when something goes wrong in the cluster

• User homes are mounted by NFS, which is not a scalable solution

• PXE is not supported to boot nodes from network.

Page 230: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

232Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR

• Open Source Cluster Application Resources (OSCAR)

• Tools for installation, administration and use of Linux clusters

• Developed by a Open Cluster Group – an informal group of people with objective to make cluster computing practical for high performance computing.

OSCAR

Page 231: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

233Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Where is OSCAR?

• Open Cluster Group sitehttp://www.OpenClusterGroup.org

• SourceForge Development Homehttp://sourceforge.net/projects/oscar

• OSCAR sitehttp://www.csm.ornl.gov/oscar

• OSCAR email lists– Users: [email protected]– Announcements: oscar-

[email protected]– Development: oscar-

[email protected]

Page 232: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

234Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR – included software

• SIS(System Installation Suite) - A collection of software packages designed to automate the installation and configuration of networked workstations

• LUI (Linux Utility for cluster Installation) – A utility for installation Linux workstations remotely over Ethernet network

• C3 (Cluster Command & Control tool) – A set of command line tools for cluster management

• Gangila – A real-time cluster monitoring tool and remote execution enviornment

• OPIUM – A password installer & user management tool

• Switcher – A tool for user environment configuration

Page 233: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

235Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR – included software

Software packages for cluster Programming and Use

• PBS (Portable Batch System) and Maui Scheduler

• A secure network connectivity suite OpenSSH

• Parallel programming libraries MPI (LAM/MPI and MPICH implementations) and PVM

Page 234: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

236Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Installation

• The installation of OSCAR must be done on an already working Linux machine.

• Its based on SIS and DHCP server• Frontend Installation

– NIC must be configured– Hard disk must have at least 2GB of free space on

a partition with /tftpboot directory– Another 2 GB for /var directory– Copy all the RPMs that compose a standard Linux

into /tftpboot– These RPMs will be used to create the OS image

that will be needed during the installation of cluster node

– Execute the program install_cluster – This updates the server and configures by

modifying file like /etc/hosts or /etc/exports – It shows a graphical wizard that will help

installation procedure

Page 235: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

237Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Installation

• Node Installation – Its done in three Phases – Cluster definition, node

installation and Cluster configuration – Cluster Definition – build the image with OS to install.

• Provide list of software packages to install • Description of disk partitioning• Node name, IP address, netmask, subnet mask,

default gateway etc. • Collect all the Ethernet addresses and match them

to their corresponding IP addresses – This is most time consuming part and GUI help us during this phase.

– Node installation – Nodes are boot from network and are automatically installed and configured

– Cluster Configuration – Frontend and node machines are configured to work together as cluster

Page 236: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

238Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Monitoring

• OSCAR also uses Gangila Cluster toolkit

Page 237: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

239Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Maintenance

• C3 (Cluster Command & Control tool suite)• C3 is a set of text oriented programs that are

executed from command line on the frontend, and takes effect on the cluster nodes.

• C3 tools include following– cexec – remote execution of commands– cget – file transfer to the frontend– cpush – file transfer from frontend– ckill – similar to Unix kill but using process name

instead PIDs– cps – similar to Unix ps– cpushimage – reinstall a node– crm – similar to Unix rm– cshutdown – shutdown individual node or whole

cluster

Page 238: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

240Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Maintenance

• We can use the OSCAR installation wizard to add or remove a cluster node

• With SIS we can add or remove software packages

• With OPIUM and switcher packages, we can manage users and user environments.

• OSCAR has program start_over, can be used for reinstall the complete cluster

Page 239: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

241Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Discussion

• OSCAR provides all aspects of Cluster management– Installation– Configuration– Monitoring– Maintenance

• We can say OSCAR is rather complete solution

Page 240: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

242Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Discussion

Other good points

• Good documentation

• The installation wizard is very helpful to install cluster

• OSCAR can be installed on nay Linux distribution although Red Hat and Mandrake are supported at present

Page 241: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

243Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - Discussion

Some bad point

• Though OSCAR has installation wizard but it requires experienced system administrator

• Because the installation of Oscar is based on images of the OS residing on the hard disk of the frontend, for each kind of hardware we have create a different image– It is necessary to create a image to nodes with

IDE disk and separate for SCSI disk

• It does not provide a solution to let us follow the node installation without using physical keyboard and monitor connected.

• The management tool C3 is rather basic

Page 242: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

244Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

OSCAR - References

• LUI http://oss.software.ibm.com/lui• SIS http://www.sisuite.org• MPICH http://www-unix.mcs.anl.gov/

mpi/mpich• LAM/MPI http://www.lam-mpi.org• PVM http://www.csm.ornl.gov/pvm• OpenPBS http://www.openpbs.org• OpenSSL http://www.openssl.org• OpenSSH http://www.openssh.com• C3 http://www.csm.ornl.gov/torc/

C3• SystemImager http://

systemimager.sourceforge.net

Page 243: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

245Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project

EU DataGrid Fabric Management Project

Page 244: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

246Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

EU DataGrid Fabric Management Project

• Work Package Four (WP4) of DataGrid project is responsible for fabric management for Next generation of computing infrastructure

• The goal of WP4 is to develop a new set of tools and techniques for the development of very large computing fabric (upto tens of thousands of processors), with reduced system administration and operating cost

• WP4 has developed software for DataGrid but can be used with general purpose clusters as well

Page 245: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

247Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 project – Included Software

• Installation – automatic installation and maintenance

• Configuration – tools for configuration information gathering, database to store configuration information, and protocols for configuration management and distribution

• Monitoring – Software for gathering, storing and analyzing information (performance, status, environment) about nodes

• Fault tolerance – Tools for problem identification and solution

Page 246: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

248Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 project - Installation

• Software installation and configuration is done with the aid of LCFG (Local ConFiGuration system)

• LCFG is a set of tools to – Install, configure, upgrade and uninstall

software for OS and applications– Monitor the status of the software– perform version management– Manage application installation dependencies– Support policy based upgrades and automated

schedule upgrades

Page 247: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

249Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project - Installation

• Frontend Installation– Install LCFG installation server software– Create a software repository directory (with at least 6

GB of free space) that will contain all RPM packages necessary to install the cluster nodes

– In addition to Red Hat Linux and LCFG server software, fronend machine must have the following services

• A DHCP server – contain node Ethernet address• NFS server – to export the installation root directory• Web server – nodes will use to fetch their

configuration profiles• DNS server – to resolve node names given their IP

address• TFTP server – if want to use PXE to boot nodes

– Finally we need list of packages and LCFG profile file, that will used to create individual XML profiles containing configuration parameters.

Page 248: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

250Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project - Installation

• Nodes Installation– For each node , we need to add the following

information in the frontend• Node Ethernet address in DHCP configuration

file• Update DNS server with IP and Host name• Modify the LCFG server configuration file

with the profile of the node• Invoke mkxprof for update XML file

– Boot the node with boot disk. The root file system will be mounted by NFS from the frontend

– Finally the node will be installed automatically

Page 249: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

251Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 project - Configuration

• Configuration of the Cluster managed by LCFG• Configuration of the cluster is described in a

set of source files, held on the frontend machine and created using High Level Description language

• Source files describe a global aspect of the cluster configuration which will be used as profile for the cluster node

• These source files must be validated and translated into node profile files (with rdxprof program).

• Profile files use XML and published using the frontend web server

• The cluster nodes fetch their XML profile files with HTTP and reconfigure themselves using a set of components scripts

Page 250: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

252Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project - Monitoring

• Monitoring is done with the aid of a set of Monitoring Sensors (MS) that run on cluster nodes

• The information generated by MS is collected at each node by Monitoring Sensor Agent

• Sensor Agents send information to frontend to a central monitoring repository

• A process call Monitoring Repository Collector (MRC) on frontend gather the information

• With web-based interface we can query any metric for any node

Page 251: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

253Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project - Maintenance

• With LCFG tools we can install remove or upgrade a software by modifying RPM package list– Remove the entry from package list and call

updaterpms to update the nodes

• With LCFG tools we can add, remove or modify users and groups from nodes

• Modify the nodes LCFG profile file and populate the change on nodes with rdxprof program

• A Fault Tolerance Engine is responsible to detect critical states on the nodes and to decide if it is necessary to dispatch any action

• An Actuator Dispatcher, that receives orders to start some Fault Tolerance Actuators.

Page 252: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

254Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 Project - Discussion

• LCFG software provides a a good solution for installation and configuration of a cluster– With the package files we can control the

software to install on the cluster nodes– With the source files we can have full control

over the configuration of nodes

• The design of the monitoring part is very flexible – We can implement our own monitoring sensor

agents, and we can monitor remote machines ( for example, network switches)

Page 253: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

255Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP 4 - Discussion

Other good points

• LCFG supports heterogeneous hardware clusters

• We can multiple configuration profiles defined in the same cluster

• We can define alarms when something goes wrong, and trigger action to solve these problems

Page 254: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

256Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

WP4 - Discussion

Bad points

• Its under development – components are not very mature

• You need a DNS server, or alternatively, to modify by hand the /etc/hosts file on the frontend

• There is not an automatic way to configure the frontend DHCP server

• The web based monitoring interface is rather basic

Page 255: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

257Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Other tools

Page 256: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

258Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Scyld Beowulf

• Scyld founded by Donald Becker• Distribution is released as Free /

Open Source (GPL’d)• Built upon RedHat 6.2 with the

difficult work of customizing for use a Beowulf already accomplished

• Represents “second generation beowulf software”– Instead of having full Linux installs on

each machine, only one machine, the master node, contains the full install

– Each slave node has a very small boot image with just enough resources to download the real “slave” operating system from the master

– Vastly simplifies software maintenance and upgrades

Page 257: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

259Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Other Tools

• OpenMosix – Multicomputer Operating System for Unix– Clustering software that allows a set of Linux

computers to work like a single system– Applications do not need MPI or PVM– No need for Batch system– Its still under development

• Score – High performance parallel programming environment for workstations and PC clusters

• Cplant – Computational Plant developed at Sandia National Laboratory, with the aim to build scalable cluster of COTS

Page 258: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

260Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Comparison of Tools

• NPACI Rocks is the easiest solution for the installation and management of a cluster under Linux

• OSCAR solutions are more flexible and complete than those provided by Rocks. But installation is more difficult

• WP4 project offers the most flexible, complete and powerful solution of the analyzed packages

• Also WP4 is most difficult solution to understand, to install and to use

Page 259: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

261Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HPC Applications and Parallel Programming

Page 260: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

262Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HPC Applications Requirements

• Scientific and Engineering Components– Physics, Biology, CFD, Astrophysics, Computer

Science and Engineering …etc..

• Numerical Algorithm Components– Finite Difference/ Finite Volume /Finite

Elements etc…– Dense Matrix Algorithms– Solving linear system of equations– Solving Sparse system of equations – Fast Fourier Transformations

Page 261: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

263Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HPC Applications Requirements

• Non - Numerical Algorithm Components – Graph Algorithms– Sorting algorithms– Search algorithms for discrete Optimization– Dynamic Programming

• Different Computational Components– Parallelism (MPI, PVM, OpenMP, Pthreads, F90,

etc..)– Architecture Efficiency (SMP, Clusters, Vector,

DSM, etc..)– I/O Bottlenecks (generate gigabytes per simulation)– High Performance Storage (High I/O throughput

from Disks)– Visualization of all that comes out

Page 262: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

264Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Future of Scientific Computing

• Require Large Scale Simulations, beyond reach of any machine

• Require Large Geo-distributed Cross Disciplinary Collaborations

• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more

processors

• New Kind of Parallelism: GRID– Harness the power of Computing Resources which

are growing

Page 263: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

265Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

HPC Applications Issues

• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters –

Message Passing– Shared Memory Systems SMP – Shared Memory

Programming– Specialized Architectures – Vector Processing, Data

Parallel Programming – The Computational Grid – Grid Programming

• Applications I/O– Parallel I/O– Need for high performance I/O systems and

techniques, scientific data libraries, and standard data representation

• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks

Page 264: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

266Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Partitioning of data

Mapping of data onto the processors

Reproducibility of results

Synchronization

Scalability and Predictability of performance

Important Issues in Parallel Programming

Page 265: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

267Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Designing Parallel Algorithms

Detect and exploit any inherent

parallelism in an existing

sequential Algorithm

Invent a new parallel algorithm

Adopt another parallel algorithm

that solves a similar problem

Page 266: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

268Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Principles of Parallel Algorithms and Design

Questions to be answered

How to partition the data?

Which data is going to be partitioned?

How many types of concurrency?

What are the key principles of designing parallel algorithms?

What are the overheads in the algorithm design?

How the mapping for balancing the load is done effectively?

Page 267: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

269Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Serial and Parallel Algorithms - Evaluation

• Serial Algorithm

– Execution time as a function of size of input

• Parallel Algorithm

– Execution time as a function of input size, parallel architecture and number of processors used

Parallel System

A parallel system is the combination of an algorithm and the parallel architecture on which its implemented

Page 268: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

270Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Architecture, Compiler, Choice of Right Algorithm, Programming Language

Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation

Success depends on the combination of

Page 269: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

271Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Parallel Programming Paradigm

Phase parallel

Divide and conquer

Pipeline

Process farm

Work pool

Remark :

The parallel program consists of number of super steps, and each super step has two phases : computation phase and interaction phase

Page 270: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

272Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Phase Parallel Model

Synchronous Interaction

C C. ..

C

Synchronous Interaction

C C . . . C

The phase-parallel model offers a paradigm that is widely used in parallel programming.

The parallel program consists of a number of supersteps, and each has two phases.

In a computation phase, multiple processes each perform an independent computation C.

In the subsequent interaction phase, the processes perform one or more synchronous interaction operations, such as a barrier or a blocking communication.

Then next superstep is executed.

Page 271: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

273Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Divide and Conquer

A parent process divides its workload into several smaller pieces and assigns them to a number of child processes.

The child processes then compute their workload in parallel and the results are merged by the parent.

The dividing and the merging procedures are done recursively.

This paradigm is very natural for computations such as quick sort. Its disadvantage is the difficulty in achieving good load balance.

Page 272: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

274Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Pipeline

P

Q

R

Data stream

In pipeline paradigm, a number of processes form a virtual pipeline.

A continuous data stream is fed into the pipeline, and the processes execute at different pipeline stages simultaneously in an overlapped fashion.

Page 273: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

275Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Process Farm

Master

Slave Slave Slave

Data stream

This paradigm is also known as the master-slave paradigm.

A master process executes the essentially sequential part of the parallel program and spawns a number of slave processes to execute the parallel workload.

When a slave finishes its workload, it informs the master which assigns a new workload to the slave.

This is a very simple paradigm, where the coordination is done by the master.

Page 274: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

276Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Work Pool

Work

Pool

P P P

Work pool

This paradigm is often used in a shared variable model.

A pool of works is realized in a global data structure.

A number of processes are created. Initially, there may be just one piece of work in the pool.

Any free process fetches a piece of work from the pool and executes it, producing zero, one, or more new work pieces put into the pool.

The parallel program ends when the work pool becomes empty.

This paradigm facilitates load balancing, as the workload is dynamically allocated to free processes.

Page 275: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

277Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Parallel Programming Models

Implicit parallelism

If the programmer does not explicitly specify parallelism, but let the compiler and the run-time support system automatically exploit it.

Explicit Parallelism

It means that parallelism is explicitly specified in the source code by the programming using special language constructs, complex directives, or library cells.

Page 276: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

278Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Explicit Parallel Programming Models

Three dominant parallel programming models are :

Data-parallel model

Message-passing model

Shared-variable model

Page 277: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

279Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Explicit Parallel Programming Models

Message – Passing

Message passing has the following characteristics :

– Multithreading

– Asynchronous parallelism (MPI reduce)

– Separate address spaces (Interaction by MPI/PVM)

– Explicit interaction

– Explicit allocation by user

Page 278: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

280Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Explicit Parallel Programming Models

Message – Passing

•Programs are multithreading and asynchronous requiring explicit synchronization

•More flexible than the data parallel model, but it still lacks support for the work pool paradigm.

•PVM and MPI can be used

•Message passing programs exploit large-grain parallelism

Page 279: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

281Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Basic Communication Operations

One-to-All Broadcast

One-to-All Personalized Communication

All-to-All Broadcast

All-to-All personalized Communication

Circular Shift

Reduction

Prefix Sum

Page 280: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

282Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

One-to-all broadcast on an eight-processor tree

Basic Communication Operations

Page 281: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

283Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Applications I/O and Parallel File Systems

• Large-scale parallel applications generate data in the terabytes range which cannot be efficiently managed by traditional serial I/O schemes (I/O bottleneck) Design your applications with parallel I/O in mind !

• Datafiles should be interchangeable in a heterogenous computing environment

• Make use of existing tools for data postprocessingand visualization

• Efficient support for checkpointing/recovery

Need for high performance I/O systems and techniques,scientific data libraries, andstandard data representation

Scientific Data Libraries

Parallel I/O LibrariesParallel FilesystemsHardware/Storage Subsystem

ApplicationsData Models and Formats

Page 282: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

284Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Three performance models based on three speedup metrics are commonly used.

Amdahl’s law -- Fixed problem size

Gustafson’s law -- Fixed time speedup

Sun-Ni’s law -- Memory Bounding speedup

Three approaches to scalability analysis are based on

• Maintaining a constant efficiency,

• A constant speed, and

• A constant utilization

Speedup metrics

Performance Metrics of Parallel Systems

Page 283: Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore.

285Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

Conclusions

Success depends on the combination of

Architecture, Compiler, Choice of Right Algorithm, Programming Language

Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation

Clusters are promising Solve parallel processing paradox Offer incremental growth and matches with

funding pattern New trends in hardware and software

technologies are likely to make clusters more promising.


Recommended