TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 1
enit c
CSC469/585: Winter 2011-12 High Availability and Performance
Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments
Chokchai (Box) Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research Group Center for Entrepreneurship and Information Technology Louisiana Tech University
enit c12 December 2011 2
Box’s 1 minute Bio l B. Eng (AE 1983): Khon Kean University l MS and PhD in CS (1995):
Ø MS Thesis: Parallel C compiler (1989) Ø PhD Thesis: Resource management/allocation in Heterogeneous Parallel Distributed
Computing (1995) l 7 years in industry (Lucent)
Ø Highly Reliable Software/system: R&D -> 4 major network management products Ø Architect, PM, Tech lead (15-30 team size)
l SWEPCO ENDOWED PROFESSOR SINCE 2007, Best Teaching Award 2003 l Louisiana Hero, www.searchKatrina.org l Associate Professor in CS since 2002.
Ø Collaborations with national and industry labs (e.g.ORNL, Intel, Ericsson, NCSA, Dell, etc)
Ø Funded research projects by DOD, NSF , DOE l Research Interest
Ø Resilience in HPC, Cluster computing, Fault Tolerance, Reliability , Availability and Serviceability (RAS) in HPC
l Services Ø Various Program committee: IEEE Cluster computing, Grid computing education Ø Co-founder and chair: High Availability and Performance Computing Workshop
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 2
enit c
Outline l Background and Motivation l Current R&D & Educational projects
Ø HA-OSCAR Architecture, infrastructure and System management
Ø Design & Dependability Analysis Ø Education in HAPC
l Conclusion
enit c
High Performance Computing
l HPC- “Hardware and Software techniques devised, for building computer systems to quickly perform large amounts of computation in the shortest possible time”
l HPC is not the same as high throughput
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 3
enit c
HPC goes mainstream l Multi core is everywhere available l GPU is hot!!! l At SC|05, Bill Gates gave a keynote as
HPC goes mainstream l MS are in HPC cluster (windows) l More critical applications requires HPC l Your mobiles and tablets are multicore
+ GPU
enit c
How to achieve HPC l Work hard – add more powerful unit(s).
Ø Faster CPU Ø More CPU, parallel architecture Ø Faster connectivity
l Work smart – better algorithms to take advantage of parallelism Ø Multiple-programming – processing (Unix
fork) Ø Multi-threading Ø Parallel programming (MPI, openMP, PVM)
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 4
enit c
Parallel Architectures Hardware Architectures for HPC and their Parallel Programming Models ● Distributed Memory Systems, MPP, Clusters - Message Passing
• Shared Memory Systems (SMP), Shared Memory Programming
• Multicore/Manycores • Specialized Architectures, Vector Processing
Data Parallel Programming (SIMD).
● The Grid, Grid Computing
enit c
TOP500.org
l The chart is from top500.org
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 5
enit c
TOP500.org
l The chart is from top500.org
enit c
TOP500.org
l The chart is from top500.org
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 6
enit c
TOP500.org
l The chart is from top500.org
enit c
TOP500.org
l The chart is from top500.org
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 7
enit c
TOP500.org
l The chart is from top500.org
enit c
TOP500.org
l The chart is from top500.org
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 8
enit c
enit c
Example of HPC app
Page is excerpted from David Klepacki‘s presentation
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 9
enit c
Example of HPC app
Page is excerpted from David Klepacki‘s presentation
enit c
Production HPC system in the real world.
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 10
enit c
Availability density of each node in the ASC White (was one of the top HPC)
Availability for White
0
0.2
0.4
0.6
0.8
1
1.2
000 200 400 600
Nodes index
A
AvailabilityAverage=0.9872STDEV=0.03292
The average is 0.98, with standard
deviation 0.033
The majority of the availability of the
each node is above 0.95 with a few of them below 0.8.
This indicates that, compared to others,
some nodes manifest outages
more. If the runtime systems are not aware of these nodes unreliability, it may result in low system total performance, extended application completion or failure.
enit c
Nodes MTTF density (in hours)
Nodes MTTF for White
0
1000
2000
3000
4000
5000
6000
1 51 101 151 201 251 301 351 401 451
Node index
MT
TF
Mean=3923 STDEV=1217
The average is 3923, with standard
deviation 1217 The maximum is 5592 hours.
The minimum is 230 hours.
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 11
enit c
Nodes Downtime (in hours) Nodes TDT for White
0
2000
4000
6000
8000
10000
1 51 101 151 201 251 301 351 401 451
Node index
TD
T Mean=355STDEV=56
The average is 355 hours, with standard
deviation 56
Most of the total down time for each node is
around 100 hours.
Some failure events cost more time due to a prolonged repair process, and thus increase the total average TDT.
enit c
Reliability Differences on the same HW
l “AND Survivability” analysis based on Ø at 10, 100, 1000 nodes all have to survive. Ø Each node MTTF at 5000 hours
l N=10, MTTF = 492.424242 l N=100, MTTF = 49.9902931 l N=1000, MTTF = 4.99999003
l Reliability and Availability info - Better Job scheduling and execution
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 12
enit c
High Performance Buildout
HPC Buildout for very very large scale system
High serviceability
High Availability
enit c
Availability l A measurement represents a ratio of
uptime vs. total times l High availability - ability of a system to
perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest.
l High availability is most often achieved through fault tolerance.
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 13
enit c
Availability Model Server up Server down & repair
S1
S1
S2
time
Availability model
HA-OSCAR dual head model
S1&S2
enit c
Availability (continued) l Availability = uptime/total time
l MTTF = Mean Time To Failure Ø Average time to failure, when it is not repairable
l MTBF = Mean Time Between Failure Ø Average time to failure, when it is repairable
l MTTR = Mean Time To Repair
l Availability = MTTF/(MTTF+MTTR)
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 14
enit c
Degree of Availability System Type Unavailability
(minutes/year) Availability (in percent)
Availability Class
Unmanaged 50,000 90 1
Managed 5,000 99 2
Well-managed 500 99.9 3
Fault-tolerant 50 99.99 4
High Availability 5 99.999 5 Very High Availability
0.5 99.9999 6
Ultra Availability 0.05
99.99999
7
enit c
Quiz. Find out for each 9’s in one year
System Type Unavailability (minutes/year) Availability
(in percent) Availability Class
Unmanaged 50,000 90 1
Managed 5,000 99 2
Well-managed 500 99.9 3
Fault-tolerant 50 99.99 4
High Availability 5 99.999 5 Very High Availability
0.5 99.9999 6
Ultra Availability 0.05
99.99999
7
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 15
enit c
Quiz #2 l Say if the machine costs $20M l Availability is 92.1% l What is the downtime and cost of
downtime?
enit c
Unavailability = No performance and functionality
l Availability enables Performance/functionality
l Performance oriented Enterprise/Shared Major computing resources- 7/24/365
l Losses of $195K - $58M with 3.5 hrs (Meta Group report)
l Service provider Regulation/Mandate Ø FCC mandate (Class 5 local switch)
l Losses time and opportunities l Life-threatening l National Security (Home Land defense)
TechEd 2002 © 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 16
enit c
What involves (non-stop services)?
users
Operation Config/Mgt
Tool
App/Service HW
Training
???
enit c
Goals l Towards Non-stop services in HPC/HEC
environments Ø High Availability (Reliability) Ø High Serviceability (planned downtime) Ø High Performance Computing (HPC) Ø We want them all