+ All Categories
Home > Documents > Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory...

Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory...

Date post: 15-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
30
Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015
Transcript
Page 1: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Dimensioning Data Centre Resiliency

HPC Advisory Council Switzerland Conference 2015

Sadaf Alam, CSCS

March 23, 2015

Page 2: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Extended Welcome

New and returning attendees, presenters & sponsors

Interconnects for systems interconnected systems

Continuous adoption of program to reflect community

needs

History of HPC Advisory Council Events, Lugano, Switzerland (source: hpcadvisorycouncil.com)

March 15-17, 2010 March 21-23, 2011 March 13-15, 2012 March 13-15, 2013 March 31 - April 3, 2014

HPC Advisory Counci Switzerland 2015 2

Page 3: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Swiss National Supercomputing Centre (CSCS)

Provides, develops, and promotes technical and scientific services for the Swiss research community in HPC (cscs.ch)

Highlights since 2010—1st Swiss HPC Advisory Council Event High Performance High Productivity (HP2C) project Introduction to hybrid CPU & GPU systems (clusters & Cray XK7) Move to new building in Lugano from Manno Launch of data and storage services Piz Daint—Hybrid Cray XC30 All hybrid CPU and GPU systems GPUDirect enabled Fully non-blocking IB FDR cluster Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of

services …

HPC Advisory Counci Switzerland 2015 3

Page 4: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Acknowledgements

Several members of staff at CSCS, specifically:

Network (Chris Gamboni)

Storage (Roberto Aielli, Stefano Gorini)

Regression suite (Tim Robinson, Gabriella Ceci)

User lab (Maria Grazia Giuffreda)

CSCS on call service (Carmelo Ponti)

Review (Nick Cardo)

Please do not hesitate to contact us for details

HPC Advisory Counci Switzerland 2015 4

Page 5: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Provisioning & Consolidation of Services at CSCS

High performance computing services

Piz Daint—Hybrid Cray XC30 Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)

Data analysis and visualization services

Visualization on Piz Daint Large memory nodes on Piz Dora

Storage and data services

Site-wide GPFS for the User Lab projects Additional GPFS and offline storage for customers

Services for customers and partners

MeteoSwiss Cray XE6 system Swiss Tier-2 Infrastructure for LHC community Fully non-blocking FDR cluster for the PASC community EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources Services for the Hilti customers

HPC Advisory Counci Switzerland 2015 5

Page 6: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Resiliency

Resilience according to merriam-webster.com

the ability to become strong, healthy, or successful again after

something bad happens

the ability of something to return to its original shape after it has been

pulled, stretched, pressed, bent, etc.

Terminology in a dynamic HPC data center environment

Bad = failure or degradation of facility OR hardware OR software

Pulled = cabling incidents, users induced errors

Stretched = power issues esp. in GPU systems, extensions

Pressed = oversubscription in a shared environment

Bent = repurposed resources, mandatory patches

… = constantly evolving requirements due to service consolidation in

a shared environment

HPC Advisory Counci Switzerland 2015 6

Page 7: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Typical Solutions for Resiliency

Redundancy two or more Piz Daint?

Failover high availability solutions, for example, MeteoSwiss

Fault tolerance component vs. vertically integrated stacks

On call services monitoring and interventions

Service migration many dependencies

HPC Advisory Counci Switzerland 2015 7

Cost effectiveness of different solutions is a key driving factor

Page 8: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

CSCS On Call Setup for Systems with Criticality Levels

Red

Defined by Service Level and Contractual elements

Highest availability requirements

Requirement for an operator

Orange

Central services

Systems with specific production requirements

Moderate coverage during on-call period

Green

No specific requirements for high availability

Mainly R&D systems

No coverage during on-call period

HPC Advisory Counci Switzerland 2015 8

Page 9: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Dimensions of a Data Center Resiliency

Perspective

Users

Operational staff

Stakeholders

Hierarchical building blocks

Holistic approach = perspective + hierarchical building

blocks

HPC Advisory Counci Switzerland 2015 9

Resilient services

Users

Stake holders

Operational Staff

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.)

Site-wide shared resources (e.g. storage systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters)

Programming and execution environment

User applications

Rate

of chang

e

Page 10: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Resiliency Dimensions @ CSCS (Perspectives)

Resilient services

Users

Stake holders

Operational Staff

HPC Advisory Counci Switzerland 2015 10

0

50

100

150

200

250

300

2010 2011 2012 2013 2014

UserLabPublica ons

Page 11: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

HPC Advisory Council & Resiliency—[an unscientific] Survey

2009 (China workshop)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU

2010 (Switzerland, ISC’10 & SC10)

Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI

2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU

2012 (Israel, Switzerland, China, Spain, ISC’12)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security

2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,

Big Data

2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)

Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big

Data

2015 (USA/Stanford, Switzerland …)

Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU,

MIC, Big Data, [Resiliency/Data Center resiliency]

HPC Advisory Counci Switzerland 2015 11

Page 12: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

InfiniBand Infrastructure @ CSCS

HPC Advisory Counci Switzerland 2015 12

Page 13: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Network Setup @ CSCS

HPC Advisory Counci Switzerland 2015 13

Page 14: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Site-wide GPFS Storage Configuration @ CSCS

HPC Advisory Counci Switzerland 2015 14

Frank Schmuck and Roger

Haskin. 2002. GPFS: A Shared-

Disk File System for Large

Computing Clusters. In

Proceedings of the 1st USENIX

Conference on File and Storage

Technologies (FAST '02). USENIX

Association, Berkeley, CA, USA.

TSM

VNX8K01 VNX8K02RAMSAN

Metadata

Data

Backup

MonchCNFS

GeneralCNFS

GPFSclients

Monch

Varioussystems

FS840

GPFS can handle node, disk and communication failures*

* Talk to Stefano Gorini for details

Page 15: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Bottom Up vs. Top Down Solutions

Bottom up (customized)

Regression testing tools

Monitors, log file analysis

and alerts

Top down

Standards driven

Community effort proposal: Robust505

Holistic approach = bottom up + top down

HPC Advisory Counci Switzerland 2015 15

ANSI/TIA-942 standard

Source: http://www.colocationamerica.com/

data-center/tier-standards-overview.htm

Page 16: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

CSCS Regression Suite Design Goals

Contract between customers & service providers

HPC Advisory Counci Switzerland 2015 16

Facility & Infrastructure

Central services and systems (e.g. network,

authentication, central database, etc.)

Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters)

Programming and execution environment

User applications

Fine-grain

regression

tests

Customer

Service provider Customer

Service provider

CS

CS

regre

ssio

n s

uite for

Piz

Dora

& P

iz D

ain

t

Ra

te o

f ch

an

ge

Page 17: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

HPC Advisory Counci Switzerland 2015 17

CSCS Regression Suite Workflow

Compile CUDA tools on a specified node TESTID: 8[0-7]00

Basic compilation and library linking

+

modules env TESTID: 5[0-6][0-6][0-6][0-9]

Compile and run libsci_acc apps TESTID: 600[0-5]

Compile and run basic CUDA tools TESTID: 7[0-7]0[0-3]

Run scientific applications TESTID: 900[1-4]

Admin

checks

Application

checks

Regression

driver

Machine load, FS, network, GOM, node

status, file systems, SLURM status, disk

status… TESTID:1000

Page 18: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

HPC Advisory Counci Switzerland 2015 18

Custom options

-m: To run after system downtime

(planned or unplanned).

Requires a reservation to check all

nodes that are up.

Runs Admin + App checks -p: To run after installation of monthly

PE release and/or any patches

obtained from Cray. It tests the

development environment

-c: To run a special suite of tests on

the node suspected to be in a bad

state -a: To run a special suite of

production applications to check for

performance regressions.

Applications are run concurrently, to

simulate a production environment.

Performances is measured.

Page 19: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

HPC Advisory Counci Switzerland 2015 19

Sample Output (I)

Running admin checks (PASSED) User who submitted

the regression

Start date

Login node where the

regression suite has

been launched

Location where the

output and the log are

stored

End date

Test passed

Page 20: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

HPC Advisory Counci Switzerland 2015 20

Sample Output (2)

Possible next steps

depend on types of

failed test (s) &

decision to resume

services depends on

the criticality of failure

(s)

Page 21: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

LogStash—An Open Source Log Management Tool

(http://logstash.net)

HPC Advisory Counci Switzerland 2015 21

Page 22: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Systems Alerts & Monitoring—homegrown examples

Network snapshot (switches, link info, subnet manager)

Change logs and alerts

Error monitoring, for instance, symbol errors indicate cable issues

Pros: effective and instantaneous detection and possibly corrections of errors

Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean

HPC Advisory Counci Switzerland 2015 22

Similar & more features likely to be available in the Unified Fabric Manager

Page 23: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Holistic Approach to Resiliency

HPC focused list of systems to incentivize robust design

solutions

HPC Advisory Counci Switzerland 2015 23

Top500 5 best

practices Robust505

Page 24: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Proposed Guidelines

Zero to minimum overhead for making a submission

Metrics (TBD):

Data collection and reporting for Top500 runs

Uptime

Failures classification (known vs. unknown)

Self-healing vs. intervention, i.e. unscheduled maintenance

Known errors database (KEDB) and/or checklist

Faster workaround & resumption of service to users

Knowledge sharing

Ganglia and Nagios integration and completeness (main system, ecosystem, file system)

Best practices from other service providers, e.g. cloud

HPC Advisory Counci Switzerland 2015 24

Page 25: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Leveraging Other Approaches & Solutions

HPC Advisory Counci Switzerland 2015 25

https://www.openstack.org

Page 26: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Leveraging Other Approaches & Solutions

HPC Advisory Counci Switzerland 2015 26

Horizon:

dashboar

d & portal

Nova:

cloud

compute

Neutron:

cloud

network

Cinder:

block

storage

Keystone:

identity

managem

ent

Glance:

image

managem

ent

REST (REpresentional State Transfer) guidelines & best practices for creating scalable web services

Page 27: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Resilience for OpenStack Deployments

High Availability (HA) presentations and papers on OpenStack (mainly at the OpenStack Summits)

Resiliency and Performance Engineering for OpenStack at Enterprise Scale by

Mirantis Checklist ~ regression

Wrecking crew More Reliable, More Resilient, More Redundant: High Availability Update for

Grizzly and Beyond by Hastexo Layer-by-layer & component-by-component analysis ~ hierarchical approach &

customer & service provider model for regression

NovaResiliency Component specific draft

Ubuntu OpenStackHA guideline

...

HPC Advisory Counci Switzerland 2015 27

"Everything fails, all the time."

Werner Vogels, VP & CTO, Amazon.com

Page 28: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Key Considerations and Next Steps

Commissioning resilient services at a data center

Cultural issues

Overhead or productive work by users & operational staff

Focus on documentation and knowledge sharing

Putting best practices into work, e.g. change management tools

Investments & cost effectiveness

Staffing considerations

Metrics for cost effectives and SLAs TBC by stake holders

Community driven efforts for HPC data center resiliency

Share your stories (happy endings plus horror)

Feedback on Robust505

Leadership by HPC advisory council with network as a focal point

HPC Advisory Counci Switzerland 2015 28

Page 29: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Acknowledgements

Several members of staff at CSCS, specifically:

Network (Chris Gamboni)

Storage (Roberto Aielli, Stefano Gorini)

Regression suite (Tim Robinson, Gabriella Ceci)

User lab (Maria Grazia Giuffreda)

CSCS on call (Carmelo Ponti)

Review (Nick Cardo)

Please do not hesitate to contact us for details

HPC Advisory Counci Switzerland 2015 29

Page 30: Dimensioning Data Centre Resiliency€¦ · Dimensioning Data Centre Resiliency HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS March 23, 2015 . Extended Welcome

Thank you for your attention.


Recommended