+ All Categories
Home > Documents > CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault...

CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault...

Date post: 25-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
16
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 1 enit c CSC469/585: Winter 2011-12 High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments Chokchai (Box) Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research Group Center for Entrepreneurship and Information Technology Louisiana Tech University enit c 12 December 2011 2 Boxs 1 minute Bio B. Eng (AE 1983): Khon Kean University MS and PhD in CS (1995): MS Thesis: Parallel C compiler (1989) PhD Thesis: Resource management/allocation in Heterogeneous Parallel Distributed Computing (1995) 7 years in industry (Lucent) Highly Reliable Software/system: R&D -> 4 major network management products Architect, PM, Tech lead (15-30 team size) SWEPCO ENDOWED PROFESSOR SINCE 2007, Best Teaching Award 2003 Louisiana Hero, www.searchKatrina.org Associate Professor in CS since 2002. Collaborations with national and industry labs (e.g.ORNL, Intel, Ericsson, NCSA, Dell, etc) Funded research projects by DOD, NSF , DOE Research Interest Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC Services Various Program committee: IEEE Cluster computing, Grid computing education Co-founder and chair: High Availability and Performance Computing Workshop
Transcript
Page 1: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 1

enit c

CSC469/585: Winter 2011-12 High Availability and Performance

Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments

Chokchai (Box) Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research Group Center for Entrepreneurship and Information Technology Louisiana Tech University

enit c12 December 2011 2

Box’s 1 minute Bio l  B. Eng (AE 1983): Khon Kean University l  MS and PhD in CS (1995):

Ø  MS Thesis: Parallel C compiler (1989) Ø  PhD Thesis: Resource management/allocation in Heterogeneous Parallel Distributed

Computing (1995) l  7 years in industry (Lucent)

Ø  Highly Reliable Software/system: R&D -> 4 major network management products Ø  Architect, PM, Tech lead (15-30 team size)

l  SWEPCO ENDOWED PROFESSOR SINCE 2007, Best Teaching Award 2003 l  Louisiana Hero, www.searchKatrina.org l  Associate Professor in CS since 2002.

Ø  Collaborations with national and industry labs (e.g.ORNL, Intel, Ericsson, NCSA, Dell, etc)

Ø  Funded research projects by DOD, NSF , DOE l  Research Interest

Ø  Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC

l  Services Ø  Various Program committee: IEEE Cluster computing, Grid computing education Ø  Co-founder and chair: High Availability and Performance Computing Workshop

Page 2: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 2

enit c

Outline l  Background and Motivation l  Current R&D & Educational projects

Ø  HA-OSCAR Architecture, infrastructure and System management

Ø  Design & Dependability Analysis Ø  Education in HAPC

l  Conclusion

enit c

High Performance Computing

l  HPC- “Hardware and Software techniques devised, for building computer systems to quickly perform large amounts of computation in the shortest possible time”

l  HPC is not the same as high throughput

Page 3: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 3

enit c

HPC goes mainstream l  Multi core is everywhere available l  GPU is hot!!! l  At SC|05, Bill Gates gave a keynote as

HPC goes mainstream l  MS are in HPC cluster (windows) l  More critical applications requires HPC l  Your mobiles and tablets are multicore

+ GPU

enit c

How to achieve HPC l  Work hard – add more powerful unit(s).

Ø  Faster CPU Ø  More CPU, parallel architecture Ø  Faster connectivity

l  Work smart – better algorithms to take advantage of parallelism Ø  Multiple-programming – processing (Unix

fork) Ø  Multi-threading Ø  Parallel programming (MPI, openMP, PVM)

Page 4: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 4

enit c

Parallel Architectures Hardware Architectures for HPC and their Parallel Programming Models ● Distributed Memory Systems, MPP, Clusters - Message Passing

• Shared Memory Systems (SMP), Shared Memory Programming

•  Multicore/Manycores •  Specialized Architectures, Vector Processing

Data Parallel Programming (SIMD).

● The Grid, Grid Computing

enit c

TOP500.org

l  The chart is from top500.org

Page 5: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 5

enit c

TOP500.org

l  The chart is from top500.org

enit c

TOP500.org

l  The chart is from top500.org

Page 6: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 6

enit c

TOP500.org

l  The chart is from top500.org

enit c

TOP500.org

l  The chart is from top500.org

Page 7: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 7

enit c

TOP500.org

l  The chart is from top500.org

enit c

TOP500.org

l  The chart is from top500.org

Page 8: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 8

enit c

enit c

Example of HPC app

Page is excerpted from David Klepacki‘s presentation

Page 9: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 9

enit c

Example of HPC app

Page is excerpted from David Klepacki‘s presentation

enit c

Production HPC system in the real world.

Page 10: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 10

enit c

Availability density of each node in the ASC White (was one of the top HPC)

Availability for White

0

0.2

0.4

0.6

0.8

1

1.2

000 200 400 600

Nodes index

A

AvailabilityAverage=0.9872STDEV=0.03292

The average is 0.98, with standard

deviation 0.033

The majority of the availability of the

each node is above 0.95 with a few of them below 0.8.

This indicates that, compared to others,

some nodes manifest outages

more. If the runtime systems are not aware of these nodes unreliability, it may result in low system total performance, extended application completion or failure.

enit c

Nodes MTTF density (in hours)

Nodes MTTF for White

0

1000

2000

3000

4000

5000

6000

1 51 101 151 201 251 301 351 401 451

Node index

MT

TF

Mean=3923 STDEV=1217

The average is 3923, with standard

deviation 1217 The maximum is 5592 hours.

The minimum is 230 hours.

Page 11: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 11

enit c

Nodes Downtime (in hours) Nodes TDT for White

0

2000

4000

6000

8000

10000

1 51 101 151 201 251 301 351 401 451

Node index

TD

T Mean=355STDEV=56

The average is 355 hours, with standard

deviation 56

Most of the total down time for each node is

around 100 hours.

Some failure events cost more time due to a prolonged repair process, and thus increase the total average TDT.

enit c

Reliability Differences on the same HW

l  “AND Survivability” analysis based on Ø  at 10, 100, 1000 nodes all have to survive. Ø  Each node MTTF at 5000 hours

l  N=10, MTTF = 492.424242 l  N=100, MTTF = 49.9902931 l  N=1000, MTTF = 4.99999003

l  Reliability and Availability info - Better Job scheduling and execution

Page 12: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 12

enit c

High Performance Buildout

HPC Buildout for very very large scale system

High serviceability

High Availability

enit c

Availability l  A measurement represents a ratio of

uptime vs. total times l  High availability - ability of a system to

perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest.

l  High availability is most often achieved through fault tolerance.

Page 13: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 13

enit c

Availability Model Server up Server down & repair

S1

S1

S2

time

Availability model

HA-OSCAR dual head model

S1&S2

enit c

Availability (continued) l  Availability = uptime/total time

l  MTTF = Mean Time To Failure Ø  Average time to failure, when it is not repairable

l  MTBF = Mean Time Between Failure Ø  Average time to failure, when it is repairable

l  MTTR = Mean Time To Repair

l  Availability = MTTF/(MTTF+MTTR)

Page 14: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 14

enit c

Degree of Availability System Type Unavailability

(minutes/year) Availability (in percent)

Availability Class

Unmanaged 50,000 90 1

Managed 5,000 99 2

Well-managed 500 99.9 3

Fault-tolerant 50 99.99 4

High Availability 5 99.999 5 Very High Availability

0.5 99.9999 6

Ultra Availability 0.05

99.99999

7

enit c

Quiz. Find out for each 9’s in one year

System Type Unavailability (minutes/year) Availability

(in percent) Availability Class

Unmanaged 50,000 90 1

Managed 5,000 99 2

Well-managed 500 99.9 3

Fault-tolerant 50 99.99 4

High Availability 5 99.999 5 Very High Availability

0.5 99.9999 6

Ultra Availability 0.05

99.99999

7

Page 15: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 15

enit c

Quiz #2 l  Say if the machine costs $20M l  Availability is 92.1% l  What is the downtime and cost of

downtime?

enit c

Unavailability = No performance and functionality

l  Availability enables Performance/functionality

l  Performance oriented Enterprise/Shared Major computing resources- 7/24/365

l  Losses of $195K - $58M with 3.5 hrs (Meta Group report)

l  Service provider Regulation/Mandate Ø  FCC mandate (Class 5 local switch)

l  Losses time and opportunities l  Life-threatening l  National Security (Home Land defense)

Page 16: CSC469/585: Winter 2011-12box/hapc/intro_hapc.pdf · " Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC ! Services

TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 16

enit c

What involves (non-stop services)?

users

Operation Config/Mgt

Tool

App/Service HW

Training

???

enit c

Goals l  Towards Non-stop services in HPC/HEC

environments Ø  High Availability (Reliability) Ø  High Serviceability (planned downtime) Ø  High Performance Computing (HPC) Ø  We want them all


Recommended