Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory...

Dimensioning Data Centre Resiliency

HPC Advisory Council Switzerland Conference 2015

Sadaf Alam, CSCS

March 23, 2015

Extended Welcome

New and returning attendees, presenters & sponsors

Interconnects for systems interconnected systems

Continuous adoption of program to reflect community

needs

History of HPC Advisory Council Events, Lugano, Switzerland (source: hpcadvisorycouncil.com)

March 15-17, 2010 March 21-23, 2011 March 13-15, 2012 March 13-15, 2013 March 31 - April 3, 2014

HPC Advisory Counci Switzerland 2015 2

Swiss National Supercomputing Centre (CSCS)

Provides, develops, and promotes technical and scientific services for the Swiss research community in HPC (cscs.ch)

Highlights since 2010—1st Swiss HPC Advisory Council Event High Performance High Productivity (HP2C) project Introduction to hybrid CPU & GPU systems (clusters & Cray XK7) Move to new building in Lugano from Manno Launch of data and storage services Piz Daint—Hybrid Cray XC30 All hybrid CPU and GPU systems GPUDirect enabled Fully non-blocking IB FDR cluster Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of

services …


Acknowledgements

Several members of staff at CSCS, specifically:

Network (Chris Gamboni)

Storage (Roberto Aielli, Stefano Gorini)

Regression suite (Tim Robinson, Gabriella Ceci)

User lab (Maria Grazia Giuffreda)

CSCS on call service (Carmelo Ponti)

Review (Nick Cardo)

Please do not hesitate to contact us for details


Provisioning & Consolidation of Services at CSCS

High performance computing services

Piz Daint—Hybrid Cray XC30 Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)

Data analysis and visualization services

Visualization on Piz Daint Large memory nodes on Piz Dora

Storage and data services

Site-wide GPFS for the User Lab projects Additional GPFS and offline storage for customers

Services for customers and partners

MeteoSwiss Cray XE6 system Swiss Tier-2 Infrastructure for LHC community Fully non-blocking FDR cluster for the PASC community EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources Services for the Hilti customers


Resiliency

Resilience according to merriam-webster.com

the ability to become strong, healthy, or successful again after

something bad happens

the ability of something to return to its original shape after it has been

pulled, stretched, pressed, bent, etc.

Terminology in a dynamic HPC data center environment

Bad = failure or degradation of facility OR hardware OR software

Pulled = cabling incidents, users induced errors

Stretched = power issues esp. in GPU systems, extensions

Pressed = oversubscription in a shared environment

Bent = repurposed resources, mandatory patches

… = constantly evolving requirements due to service consolidation in

a shared environment


Typical Solutions for Resiliency

Redundancy two or more Piz Daint?

Failover high availability solutions, for example, MeteoSwiss

Fault tolerance component vs. vertically integrated stacks

On call services monitoring and interventions

Service migration many dependencies


Cost effectiveness of different solutions is a key driving factor

CSCS On Call Setup for Systems with Criticality Levels

Red

Defined by Service Level and Contractual elements

Highest availability requirements

Requirement for an operator

Orange

Central services

Systems with specific production requirements

Moderate coverage during on-call period

Green

No specific requirements for high availability

Mainly R&D systems

No coverage during on-call period


Dimensions of a Data Center Resiliency

Perspective

Users

Operational staff

Stakeholders

Hierarchical building blocks

Holistic approach = perspective + hierarchical building

blocks


Resilient services

Users

Stake holders

Operational Staff

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.)

Site-wide shared resources (e.g. storage systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters)

Programming and execution environment

User applications

Rate

of chang

e

Resiliency Dimensions @ CSCS (Perspectives)

Resilient services

Users

Stake holders

Operational Staff


0

50

100

150

200

250

300

2010 2011 2012 2013 2014

UserLabPublica ons

HPC Advisory Council & Resiliency—[an unscientific] Survey

2009 (China workshop)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU

2010 (Switzerland, ISC’10 & SC10)

Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI

2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU

2012 (Israel, Switzerland, China, Spain, ISC’12)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security

2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,

Big Data

2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big

Data

2015 (USA/Stanford, Switzerland …)

Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU,

MIC, Big Data, [Resiliency/Data Center resiliency]


InfiniBand Infrastructure @ CSCS


Network Setup @ CSCS


Site-wide GPFS Storage Configuration @ CSCS


Frank Schmuck and Roger

Haskin. 2002. GPFS: A Shared-

Disk File System for Large

Computing Clusters. In

Proceedings of the 1st USENIX

Conference on File and Storage

Technologies (FAST '02). USENIX

Association, Berkeley, CA, USA.

TSM

VNX8K01 VNX8K02RAMSAN

Metadata

Data

Backup

MonchCNFS

GeneralCNFS

GPFSclients

Monch

Varioussystems

FS840

GPFS can handle node, disk and communication failures*

* Talk to Stefano Gorini for details

Bottom Up vs. Top Down Solutions

Bottom up (customized)

Regression testing tools

Monitors, log file analysis

and alerts

Top down

Standards driven

Community effort proposal: Robust505

Holistic approach = bottom up + top down


ANSI/TIA-942 standard

Source: http://www.colocationamerica.com/

data-center/tier-standards-overview.htm

CSCS Regression Suite Design Goals

Contract between customers & service providers


Facility & Infrastructure

Central services and systems (e.g. network,

authentication, central database, etc.)

Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters)

Programming and execution environment

User applications

Fine-grain

regression

tests

Customer

Service provider Customer

Service provider

CS

CS

regre

ssio

n s

uite for

Piz

Dora

& P

iz D

ain

t

Ra

te o

f ch

an

ge


CSCS Regression Suite Workflow

Compile CUDA tools on a specified node TESTID: 8[0-7]00

Basic compilation and library linking

+

modules env TESTID: 5[0-6][0-6][0-6][0-9]

Compile and run libsci_acc apps TESTID: 600[0-5]

Compile and run basic CUDA tools TESTID: 7[0-7]0[0-3]

Run scientific applications TESTID: 900[1-4]

Admin

checks

Application

checks

Regression

driver

Machine load, FS, network, GOM, node

status, file systems, SLURM status, disk

status… TESTID:1000


Custom options

-m: To run after system downtime

(planned or unplanned).

Requires a reservation to check all

nodes that are up.

Runs Admin + App checks -p: To run after installation of monthly

PE release and/or any patches

obtained from Cray. It tests the

development environment

-c: To run a special suite of tests on

the node suspected to be in a bad

state -a: To run a special suite of

production applications to check for

performance regressions.

Applications are run concurrently, to

simulate a production environment.

Performances is measured.


Sample Output (I)

Running admin checks (PASSED) User who submitted

the regression

Start date

Login node where the

regression suite has

been launched

Location where the

output and the log are

stored

End date

Test passed


Sample Output (2)

Possible next steps

depend on types of

failed test (s) &

decision to resume

services depends on

the criticality of failure

(s)

LogStash—An Open Source Log Management Tool

(http://logstash.net)


Systems Alerts & Monitoring—homegrown examples

Network snapshot (switches, link info, subnet manager)

Change logs and alerts

Error monitoring, for instance, symbol errors indicate cable issues

Pros: effective and instantaneous detection and possibly corrections of errors

Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean


Similar & more features likely to be available in the Unified Fabric Manager

Holistic Approach to Resiliency

HPC focused list of systems to incentivize robust design

solutions


Top500 5 best

practices Robust505

Proposed Guidelines

Zero to minimum overhead for making a submission

Metrics (TBD):

Data collection and reporting for Top500 runs

Uptime

Failures classification (known vs. unknown)

Self-healing vs. intervention, i.e. unscheduled maintenance

Known errors database (KEDB) and/or checklist

Faster workaround & resumption of service to users

Knowledge sharing

Ganglia and Nagios integration and completeness (main system, ecosystem, file system)

Best practices from other service providers, e.g. cloud


Leveraging Other Approaches & Solutions


https://www.openstack.org

Leveraging Other Approaches & Solutions


Horizon:

dashboar

d & portal

Nova:

cloud

compute

Neutron:

cloud

network

Cinder:

block

storage

Keystone:

identity

managem

ent

Glance:

image

managem

ent

REST (REpresentional State Transfer) guidelines & best practices for creating scalable web services

Resilience for OpenStack Deployments

High Availability (HA) presentations and papers on OpenStack (mainly at the OpenStack Summits)

Resiliency and Performance Engineering for OpenStack at Enterprise Scale by

Mirantis Checklist ~ regression

Wrecking crew More Reliable, More Resilient, More Redundant: High Availability Update for

Grizzly and Beyond by Hastexo Layer-by-layer & component-by-component analysis ~ hierarchical approach &

customer & service provider model for regression

NovaResiliency Component specific draft

Ubuntu OpenStackHA guideline

...


"Everything fails, all the time."

Werner Vogels, VP & CTO, Amazon.com

Key Considerations and Next Steps

Commissioning resilient services at a data center

Cultural issues

Overhead or productive work by users & operational staff

Focus on documentation and knowledge sharing

Putting best practices into work, e.g. change management tools

Investments & cost effectiveness

Staffing considerations

Metrics for cost effectives and SLAs TBC by stake holders

Community driven efforts for HPC data center resiliency

Share your stories (happy endings plus horror)

Feedback on Robust505

Leadership by HPC advisory council with network as a focal point


Acknowledgements

Several members of staff at CSCS, specifically:

Network (Chris Gamboni)

Storage (Roberto Aielli, Stefano Gorini)

Regression suite (Tim Robinson, Gabriella Ceci)

User lab (Maria Grazia Giuffreda)

CSCS on call (Carmelo Ponti)

Review (Nick Cardo)

Please do not hesitate to contact us for details


Thank you for your attention.

Date post:	15-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory...

Documents