Dimensioning Data Centre Resiliency
HPC Advisory Council Switzerland Conference 2015
Sadaf Alam, CSCS
March 23, 2015
Extended Welcome
New and returning attendees, presenters & sponsors
Interconnects for systems interconnected systems
Continuous adoption of program to reflect community
needs
History of HPC Advisory Council Events, Lugano, Switzerland (source: hpcadvisorycouncil.com)
March 15-17, 2010 March 21-23, 2011 March 13-15, 2012 March 13-15, 2013 March 31 - April 3, 2014
HPC Advisory Counci Switzerland 2015 2
Swiss National Supercomputing Centre (CSCS)
Provides, develops, and promotes technical and scientific services for the Swiss research community in HPC (cscs.ch)
Highlights since 2010—1st Swiss HPC Advisory Council Event High Performance High Productivity (HP2C) project Introduction to hybrid CPU & GPU systems (clusters & Cray XK7) Move to new building in Lugano from Manno Launch of data and storage services Piz Daint—Hybrid Cray XC30 All hybrid CPU and GPU systems GPUDirect enabled Fully non-blocking IB FDR cluster Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of
services …
HPC Advisory Counci Switzerland 2015 3
Acknowledgements
Several members of staff at CSCS, specifically:
Network (Chris Gamboni)
Storage (Roberto Aielli, Stefano Gorini)
Regression suite (Tim Robinson, Gabriella Ceci)
User lab (Maria Grazia Giuffreda)
CSCS on call service (Carmelo Ponti)
Review (Nick Cardo)
Please do not hesitate to contact us for details
HPC Advisory Counci Switzerland 2015 4
Provisioning & Consolidation of Services at CSCS
High performance computing services
Piz Daint—Hybrid Cray XC30 Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)
Data analysis and visualization services
Visualization on Piz Daint Large memory nodes on Piz Dora
Storage and data services
Site-wide GPFS for the User Lab projects Additional GPFS and offline storage for customers
Services for customers and partners
MeteoSwiss Cray XE6 system Swiss Tier-2 Infrastructure for LHC community Fully non-blocking FDR cluster for the PASC community EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources Services for the Hilti customers
HPC Advisory Counci Switzerland 2015 5
Resiliency
Resilience according to merriam-webster.com
the ability to become strong, healthy, or successful again after
something bad happens
the ability of something to return to its original shape after it has been
pulled, stretched, pressed, bent, etc.
Terminology in a dynamic HPC data center environment
Bad = failure or degradation of facility OR hardware OR software
Pulled = cabling incidents, users induced errors
Stretched = power issues esp. in GPU systems, extensions
Pressed = oversubscription in a shared environment
Bent = repurposed resources, mandatory patches
… = constantly evolving requirements due to service consolidation in
a shared environment
HPC Advisory Counci Switzerland 2015 6
Typical Solutions for Resiliency
Redundancy two or more Piz Daint?
Failover high availability solutions, for example, MeteoSwiss
Fault tolerance component vs. vertically integrated stacks
On call services monitoring and interventions
Service migration many dependencies
HPC Advisory Counci Switzerland 2015 7
Cost effectiveness of different solutions is a key driving factor
CSCS On Call Setup for Systems with Criticality Levels
Red
Defined by Service Level and Contractual elements
Highest availability requirements
Requirement for an operator
Orange
Central services
Systems with specific production requirements
Moderate coverage during on-call period
Green
No specific requirements for high availability
Mainly R&D systems
No coverage during on-call period
HPC Advisory Counci Switzerland 2015 8
Dimensions of a Data Center Resiliency
Perspective
Users
Operational staff
Stakeholders
Hierarchical building blocks
Holistic approach = perspective + hierarchical building
blocks
HPC Advisory Counci Switzerland 2015 9
Resilient services
Users
Stake holders
Operational Staff
Facility & Infrastructure
Central services and systems (e.g. network, authentication, central database, etc.)
Site-wide shared resources (e.g. storage systems, InifniBand, etc.)
Computing systems (e.g. Cray and clusters)
Programming and execution environment
User applications
Rate
of chang
e
Resiliency Dimensions @ CSCS (Perspectives)
Resilient services
Users
Stake holders
Operational Staff
HPC Advisory Counci Switzerland 2015 10
0
50
100
150
200
250
300
2010 2011 2012 2013 2014
UserLabPublica ons
HPC Advisory Council & Resiliency—[an unscientific] Survey
2009 (China workshop)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU
2010 (Switzerland, ISC’10 & SC10)
Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI
2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU
2012 (Israel, Switzerland, China, Spain, ISC’12)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security
2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,
Big Data
2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big
Data
2015 (USA/Stanford, Switzerland …)
Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU,
MIC, Big Data, [Resiliency/Data Center resiliency]
HPC Advisory Counci Switzerland 2015 11
InfiniBand Infrastructure @ CSCS
HPC Advisory Counci Switzerland 2015 12
Network Setup @ CSCS
HPC Advisory Counci Switzerland 2015 13
Site-wide GPFS Storage Configuration @ CSCS
HPC Advisory Counci Switzerland 2015 14
Frank Schmuck and Roger
Haskin. 2002. GPFS: A Shared-
Disk File System for Large
Computing Clusters. In
Proceedings of the 1st USENIX
Conference on File and Storage
Technologies (FAST '02). USENIX
Association, Berkeley, CA, USA.
TSM
VNX8K01 VNX8K02RAMSAN
Metadata
Data
Backup
MonchCNFS
GeneralCNFS
GPFSclients
Monch
Varioussystems
FS840
GPFS can handle node, disk and communication failures*
* Talk to Stefano Gorini for details
Bottom Up vs. Top Down Solutions
Bottom up (customized)
Regression testing tools
Monitors, log file analysis
and alerts
Top down
Standards driven
Community effort proposal: Robust505
Holistic approach = bottom up + top down
HPC Advisory Counci Switzerland 2015 15
ANSI/TIA-942 standard
Source: http://www.colocationamerica.com/
data-center/tier-standards-overview.htm
CSCS Regression Suite Design Goals
Contract between customers & service providers
HPC Advisory Counci Switzerland 2015 16
Facility & Infrastructure
Central services and systems (e.g. network,
authentication, central database, etc.)
Site-wide shared resources (e.g. storage
systems, InifniBand, etc.)
Computing systems (e.g. Cray and clusters)
Programming and execution environment
User applications
Fine-grain
regression
tests
Customer
Service provider Customer
Service provider
CS
CS
regre
ssio
n s
uite for
Piz
Dora
& P
iz D
ain
t
Ra
te o
f ch
an
ge
HPC Advisory Counci Switzerland 2015 17
CSCS Regression Suite Workflow
Compile CUDA tools on a specified node TESTID: 8[0-7]00
Basic compilation and library linking
+
modules env TESTID: 5[0-6][0-6][0-6][0-9]
Compile and run libsci_acc apps TESTID: 600[0-5]
Compile and run basic CUDA tools TESTID: 7[0-7]0[0-3]
Run scientific applications TESTID: 900[1-4]
Admin
checks
Application
checks
Regression
driver
Machine load, FS, network, GOM, node
status, file systems, SLURM status, disk
status… TESTID:1000
HPC Advisory Counci Switzerland 2015 18
Custom options
-m: To run after system downtime
(planned or unplanned).
Requires a reservation to check all
nodes that are up.
Runs Admin + App checks -p: To run after installation of monthly
PE release and/or any patches
obtained from Cray. It tests the
development environment
-c: To run a special suite of tests on
the node suspected to be in a bad
state -a: To run a special suite of
production applications to check for
performance regressions.
Applications are run concurrently, to
simulate a production environment.
Performances is measured.
HPC Advisory Counci Switzerland 2015 19
Sample Output (I)
Running admin checks (PASSED) User who submitted
the regression
Start date
Login node where the
regression suite has
been launched
Location where the
output and the log are
stored
End date
Test passed
HPC Advisory Counci Switzerland 2015 20
Sample Output (2)
Possible next steps
depend on types of
failed test (s) &
decision to resume
services depends on
the criticality of failure
(s)
LogStash—An Open Source Log Management Tool
(http://logstash.net)
HPC Advisory Counci Switzerland 2015 21
Systems Alerts & Monitoring—homegrown examples
Network snapshot (switches, link info, subnet manager)
Change logs and alerts
Error monitoring, for instance, symbol errors indicate cable issues
Pros: effective and instantaneous detection and possibly corrections of errors
Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean
HPC Advisory Counci Switzerland 2015 22
Similar & more features likely to be available in the Unified Fabric Manager
Holistic Approach to Resiliency
HPC focused list of systems to incentivize robust design
solutions
HPC Advisory Counci Switzerland 2015 23
Top500 5 best
practices Robust505
Proposed Guidelines
Zero to minimum overhead for making a submission
Metrics (TBD):
Data collection and reporting for Top500 runs
Uptime
Failures classification (known vs. unknown)
Self-healing vs. intervention, i.e. unscheduled maintenance
Known errors database (KEDB) and/or checklist
Faster workaround & resumption of service to users
Knowledge sharing
Ganglia and Nagios integration and completeness (main system, ecosystem, file system)
Best practices from other service providers, e.g. cloud
HPC Advisory Counci Switzerland 2015 24
Leveraging Other Approaches & Solutions
HPC Advisory Counci Switzerland 2015 25
https://www.openstack.org
Leveraging Other Approaches & Solutions
HPC Advisory Counci Switzerland 2015 26
Horizon:
dashboar
d & portal
Nova:
cloud
compute
Neutron:
cloud
network
Cinder:
block
storage
Keystone:
identity
managem
ent
Glance:
image
managem
ent
REST (REpresentional State Transfer) guidelines & best practices for creating scalable web services
Resilience for OpenStack Deployments
High Availability (HA) presentations and papers on OpenStack (mainly at the OpenStack Summits)
Resiliency and Performance Engineering for OpenStack at Enterprise Scale by
Mirantis Checklist ~ regression
Wrecking crew More Reliable, More Resilient, More Redundant: High Availability Update for
Grizzly and Beyond by Hastexo Layer-by-layer & component-by-component analysis ~ hierarchical approach &
customer & service provider model for regression
NovaResiliency Component specific draft
Ubuntu OpenStackHA guideline
...
HPC Advisory Counci Switzerland 2015 27
"Everything fails, all the time."
Werner Vogels, VP & CTO, Amazon.com
Key Considerations and Next Steps
Commissioning resilient services at a data center
Cultural issues
Overhead or productive work by users & operational staff
Focus on documentation and knowledge sharing
Putting best practices into work, e.g. change management tools
Investments & cost effectiveness
Staffing considerations
Metrics for cost effectives and SLAs TBC by stake holders
Community driven efforts for HPC data center resiliency
Share your stories (happy endings plus horror)
Feedback on Robust505
Leadership by HPC advisory council with network as a focal point
HPC Advisory Counci Switzerland 2015 28
Acknowledgements
Several members of staff at CSCS, specifically:
Network (Chris Gamboni)
Storage (Roberto Aielli, Stefano Gorini)
Regression suite (Tim Robinson, Gabriella Ceci)
User lab (Maria Grazia Giuffreda)
CSCS on call (Carmelo Ponti)
Review (Nick Cardo)
Please do not hesitate to contact us for details
HPC Advisory Counci Switzerland 2015 29
Thank you for your attention.