+ All Categories
Home > Documents > Hobbes:’’ OS’and’Run/me’Supportfor’...

Hobbes:’’ OS’and’Run/me’Supportfor’...

Date post: 19-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
40
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Hobbes: OS and Run/me Support for Applica/on Composi/on Ron Brightwell Coordina/ng PI XStack/OSR PI Mee/ng December 78, 2015
Transcript
Page 1: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Hobbes:    OS  and  Run/me  Support  for  Applica/on  Composi/on    

Ron  Brightwell  Coordina/ng  PI              X-­‐Stack/OSR  PI  Mee/ng  December  7-­‐8,  2015  

 

Page 2: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hobbes  Team  Institution Person Role

Georgia Institute of Technology Ada Gavrilovska PI

Indiana University Thomas Sterling PI

Los Alamos National Lab Mike Lang PI

Lawrence Berkeley National Lab Costin Iancu PI

North Carolina State University Frank Mueller PI

Northwestern University Peter Dinda PI

Oak Ridge National Laboratory David Bernholdt PI

Oak Ridge National Laboratory Arthur B. Maccabe Chief Scientist

Sandia National Laboratories Ron Brightwell Coordinating PI

University of Arizona David Lowenthal PI

University of California – Berkeley Eric Brewer PI

University of New Mexico Patrick Bridges PI

University of Pittsburgh Jack Lange PI

Page 3: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Project  Goals  §  Deliver  prototype  OS/R  environment  for  R&D  in  extreme-­‐

scale  scien/fic  compu/ng  §  Focus  on  applica/on  composi/on  as  a  fundamental  driver  

§  Develop  necessary  OS/R  interfaces  and  system  services  required  to  support  resource  isola/on  and  sharing  

§  Support  complex  simula/on  and  analysis  workflows  

§  Provide    a  lightweight  OS/R  environment  with  flexibility  to  build  custom  run/mes  §  Compose  applica/ons  from  a  collec/on  of  enclaves  

§  Leverage  KiZen  lightweight  kernel  and  Palacios  lightweight  virtual  machine  monitor  §  Enable  high-­‐risk  high-­‐impact  research  in  virtualiza/on,  energy/power,  

scheduling,  and  resilience  

Page 4: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

KiZen  Lightweight  Kernel  §  Monolithic,  C  code,  GNU  toolchain,  Kbuild  configura/on  §  Supports  x86-­‐64  and  ARM-­‐64  

§  Boots  on  standard  PC  architecture,  Cray  XT,  and  in  virtual  machines  §  Boots  iden/cally  to  Linux  (KiZen  bzImage  and  init_task)  

§  Repurposes  basic  func/onality  from  Linux  §  Hardware  bootstrap  §  Basic  OS  kernel  primi/ves  (lists,  locks,  wait  queues,  etc.)  §  Directory  structure  similar  to  Linux,  arch  dependent/independent  dirs  

§  Custom  address  space  management  and  task  management  §  User-­‐level  API  for  managing  physical  memory,  building  virtual  address  spaces  §  User-­‐level  API  for  crea/ng  tasks,  which  run  in  virtual  address  spaces  

§  Small,  highly  reliable  code  base  §  Focused  on  scalable  HPC  applica/ons  

§  Low  noise  §  Small  memory  footprint  

§  Open  source  and  freely  available  

Page 5: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Palacios  Virtual  Machine  Monitor  §  OS-­‐independent  embeddable  virtual  machine  monitor  

§  Can  be  combined  with  KiZen  or  Linux  

§  Full  system  virtualiza/on  §  No  need  to  modify  guest  OS  

§  Supports  running  mul/ple  guests  concurrently  §  Makes  extensive  use  of  virtualiza/on  extensions  in  modern  

Intel  and  AMD  x86  processors  §  Passthrough  resource  par//oning  §  Extensive  configurability  §  Low  noise  §  Open  source  and  freely  available  §  Small,  highly  reliable  code  base  

Page 6: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Systems  Are  Converging  to  Reduce  Data  Movement  

§  External  parallel  file  system  is  being  subsumed  §  Near-­‐term  capability  systems  using  NVRAM-­‐based  burst  buffer  §  Future  extreme-­‐scale  systems  will  con/nue  to  exploit  persistent  

memory  technologies  

§  In-­‐situ  and  in-­‐transit  approaches  for  visualiza/on  and  analysis  §  Can’t  afford  to  move  data  to  separate  systems  for  processing  §  GPUs  and  many-­‐core  processors  are  ideal  for  visualiza/on  and  some  

analysis  func/ons  

§  Less  differen/a/on  between  advanced  technology  and  commodity  technology  systems  §  On-­‐chip  integra/on  of  processing,            memory,  and  network  §  Summit/Sierra  using  InfiniBand  

Exascale System

Capability System

Analytics Cluster

Parallel File System Visualization

Cluster

Capacity Cluster

Page 7: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Applica/ons  and  Usage  Models  are  Diverging    §  Applica/on  composi/on  becoming  more  important  

§  Ensemble  calcula/ons  for  uncertainty  quan/fica/on  §  Mul/-­‐{material,  physics,  scale}  simula/ons  §  In-­‐situ  analysis  and  graph  analy/cs  §  Performance  and  correctness  analysis  tools  

§  Applica/ons  may  be  composed  of  mul/ple  programming  models  §  More  complex  workflows  are  driving  need  for  advanced  OS  services  and  

capability  §  “Workflow”  overtaken  “Co-­‐Design”  as  most  popular  DOE  buzzword  J  

§  Desire  to  support  “Big  Data”  applica/ons  §  Significant  somware  stack  comes  along  with  this  

§  Support  for  more  interac/ve  workloads  §  Requirements  are  independent  of  programming  model  and  hardware    

Page 8: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Mul/physics  Example  

Page 9: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Mul/physics  Example  (cont’d)  

Page 10: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Mul/physics  Example  (concl’d)  

Page 11: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

ENCLAVE  COMPOSITION  

Page 12: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

A  Deeper  Look  at  Composi/on  

Intra-­‐Node  Composi:on    §  Components  co-­‐located  on  same  

set  of  nodes  §  Isolate  NOS  environments  on  

each  node  §  Composi/on  (coupling)  takes  

place  via  shared  memory  

Inter-­‐Node  Composi:on    §  Components  deployed  to  

separate  sets  of  nodes  §  Composi/on  (coupling)  takes  

place  via  network  

Page 13: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Composi/on  of  Enclaves  

§  An  enclave  provides  a  single  OS/R  environment  to  the  applica/on  §  Hobbes  approach  is  to  provide  the  minimum  “amount”  of  OS/R  required  

by  the  applica/on  (do  what  is  necessary  and  get  out  of  the  way!)  §  Modern,  complex  applica/ons  are  increasingly  created  by  assembling  

(omen  substan/al)  somware  components  §  E.g.,  analy/cs  connected  to  applica/ons,  code  coupling,  applica/on  

frameworks,  …  

§  Components  may  have  dis/nct  requirements  for  OS/R  support  §  Two  op/ons:  §  Assemble  an  all-­‐in-­‐one  OS/R  stack  that  sa/sfies  all  component  needs  

§  Poten/al  challenges  at  both  OS  and  RTS  levels  §  Requires  integra/on  work  for  every  combina/on  supported  

§  Provide  each  component  the  OS/R  it  needs,  and  provide  efficient,  low-­‐level  mechanisms  to  connect  the  components  (and  the  OS/Rs)  

Page 14: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Composi/on  Examples  

§  SNAP  +  Analy/cs  §  “SNAP  calculates  synonymous  and  non-­‐synonymous  subs/tu/on  rates  

based  on  a  set  of  codon-­‐aligned  nucleo/de  sequences.”  (HIV  related)  §  Proxy  app  from  LANL  used  for  example  

§  GTC-­‐P  +  Analy/cs  §  Fusion  simula/on  tes/ng/proxy  app  used  to  test  new  hardware  and  

algorithm  integra/on  into  the  PIC  model.  (PPPL)  §  Analy/cs  generate  sta/s/cs  on  par/cles  (histograms),  sorts,  and  filters  

on  bounding  boxes  

§  LAMMPS  +  Analy/cs  §  Full,  produc/on  molecular  dynamics  applica/on  from  Sandia  §  Analy/cs  look  for  crack  forma/on  by  calcula/ng  atomic  spacing  in  

output  data  to  change  simula/on  from  coarse  to  fine  grained.  

Page 15: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Composi/on  Approach  

§  ADIOS  or  TCASM  as  user-­‐facing  data  exchange  API  

§  XEMEM  (cross-­‐enclave  shared  memory)  as  low-­‐level  transport  §  XEMEM  extends  XPMEM  with  

global  name  service  §  Cross-­‐enclave  signaling  under  

development  §  KiZen  lightweight  kernel  §  Palacios  high-­‐performance  

virtual  machine  monitor  §  Pisces  co-­‐kernel  

architecture  

§  Composi/on  demonstrated  with  different  OS  combina/ons  §  Cray  Compute  Node  Linux  

(CNL)  and  stock  Linux  §  CNL  and  KiZen  §  KiZen  and  KiZen  

Kitten Co-Kernel (Pisces)

Hardware A

DIO S

XEM

EM

Hobbes Runtime

Application

Operating System

Simulation

Linux

TCA

SM

TCA

SM

AD

IO S

XEM

EM

Analytics

Page 16: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Composi/on  Working  Group  Current  Focus  Areas  

§  Composi/on  use  cases  §  Iden/fying  use  case  categories  and  specific  examples  §  Analyzing  coupling  requirements  of  specific  use  cases  §  Iden/fying  abstrac/ons  necessary  for  user-­‐  and  system-­‐level  

composi/on  

§  “Hobbes  composi/on  language”  §  How  to  describe  a  composite  applica/on  and  its  mapping  onto  system  

resources  (“job  control  language”  for  composite  applica/on)  §  Gathering  and  understanding  “related  work”  

§  E.g.  workflow  languages,  cloud  configura/on/deployment  tools,  CCA,  ADIOS  &  TCASM  APIs,  constraint  programming  languages,  …  

Page 17: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Composi/on  Working  Group  Goals  

§  Working  towards  three  basic  API  specifica/ons…  §  Applica:on-­‐level  composi:on  API  

§  Basic  user-­‐level  abstrac/ons  for  cross-­‐enclave  data  sharing  §  May  be  several  dis/nct  (sets  of)  abstrac/ons  §  Not  necessarily  implemented  by/in  Hobbes  (e.g.  ADIOS,  TCASM,  …)  

§  System-­‐level  composi:on  API  §  Used  to  write  cross-­‐enclave  “transports”  to  be  used  by  applica/on-­‐

level  composi/on  libraries  §  Cross-­‐enclave  shared  memory  (example),  cross-­‐enclave  signaling,  

node-­‐level  name  service  §  Node-­‐level  enclave  management  API  

§  Set  and  query  configura/on  of  enclaves  on  node  §  Resource  alloca/on,  VMM  &  OS,  …  

Page 18: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Enabling  Mul/-­‐OS/R  Stack  Applica/on  Composi/on  

§     

In-situ Simulation + Analytics composition in single Linux OS vs. Multiple Enclaves

•  Problem •  HPC applications evolving to more compositional approach, overall application is a

composition of coupled simulation, analysis, and tool components •  Each component may have different OS/R requirements, no “one-size-fits-all” OS/R stack

•  Solution •  Partition node-level resources into “enclaves”, run different OS/R instance in each enclave

Pisces Co-kernel Architecture: http://www.prognosticlab.org/pisces/ •  Provide tools for creating and managing enclaves, launching applications into enclaves

Leviathan Node Manager: http://www.prognosticlab.org/leviathan/ •  Provide mechanisms for cross-enclave application composition and synchronization

XEMEM Shared Memory: http://www.prognosticlab.org/xemem/

•  Recent results •  Demonstrated Multi-OS/R approach provides excellent

performance isolation; better than native performance possible •  Demonstrated drop in compatibility with both commodity and

Cray Linux environments •  Impact

•  Application components with differing OS/R requirements can be composed together efficiently within a compute node, minimizing off-node data movement

•  Compatible with unmodified vendor provided OS/R environments, simplifies deployment

Page 19: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

•  Problem –  HPC applications are increasingly comprised of multiple

distinct components with different requirements for OS, software stack and system resources –  E.g., simulation+analytics, coupled multiphysics, scalable

performance analysis and debugging

•  Solution –  Instantiate “enclaves” for each application component using

high-performance virtualization technology –  Provide OS and software stack tailored for application component within each enclave –  Provide mechanisms for controlled interaction between enclaves (components)

–  Selective sharing of memory regions (data exchange) –  Name service (discovery and rendezvous)

§  Recent results –  Proof-of-principle for XEMEM cross-enclave memory API –  Use XEMEM as “transport” in ADIOS, TCASM coupling tools –  Demonstrate composite simulation+analytics applications using XEMEM

– Impact –  Composition can be made transparent at the application level (no changes, performance neutral) –  Allows detailed resource management and scheduling among enclaves (other Hobbes R&D areas)

System-­‐Level  Support  for  Composi/on  of  Applica/ons  

Page 20: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

NODE  VIRTUALIZATION  LAYER  

Page 21: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Recent  NVL  Accomplishments  §  Demonstrated  Node  Virtualiza/on  Layer  (NVL)  Func/onality  

§  Improved  inter-­‐enclave  isola/on  (HPDC’15)  §  High-­‐performance  inter-­‐enclave  memory  sharing  (HPDC’15)  §  Cross-­‐enclave  code  coupling  (ROSS’15)    

§  Hybrid  VMM  Basic  Func/onality  §  Boo/ng  on  all  cores  of  Knights  Corner  Phi  (228  cores)  §  Demonstrated  Hybrid  VMM  “boot”  faster  than  na/ve  Linux  fork/exec  

§  NVL  Next  Steps  in  Development  §  Libhobbes  –  library  combining  NVL  client  func/onality  under  “one  roof”  §  Mul/-­‐node  support  -­‐-­‐  NVL  driver  for  Cray  Gemini  network  §  Integra/on  of  TCASM  with  XPMEM  to  enable  another  form  of  cross  

enclave  code  coupling    

Page 22: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

NVL  Architecture  

Linux

Hardware

Isolated Virtual Machine

Applications +

Virtual MachinesPalacios VMM

Kitten Co-kernel (1)

Kitten Co-kernel(2)

Isolated Application

Pisces Pisces

§  Co-­‐kernel  Architecture:  Mul/ple  OS  kernels  run  side-­‐by-­‐side  on  same  node  in  different  enclaves  

§  Pisces  infrastructure  used  to  launch  and  manage  encalves  and  bind  enclaves  together  

§  XEMEM  mechanism  developed  to  enable  cross-­‐enclave  memory  sharing  

Hardware'Par))on' Hardware'Par))on'

User%Context%

Kernel%Context% Linux'

Cross%Kernel*Messages*

Control'Process'

Control'Process'

Shared*Mem**Ctrl*Channel*

Linux'Compa)ble'Workloads'

Isolated'Processes''

+'Virtual'

Machines'

Shared*Mem*Communica6on*Channels*

Ki@en'CoAKernel'

1. Co-Kernel Architecture, Three Enclave Example

2. Cross-enclave communication used for enclave control and for cross-enclave app code coupling

Page 23: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

NVL  Provides  Excellent  Inter-­‐Enclave  Isola/on    Native Linux, Single Linux Image

Kitten Enclave with Same Competing Workload

§  Co-­‐kernel  Architecture  nearly  eliminates  OS-­‐induced  interference  

§  When  using  single  Linux  OS  (top),  compe/ng  workload  induces  noise  on  other  processes,  even  when  they  are  pinned  to  disjoint  cores  and  memory  

§  Isola/ng  processes  in  a  separate  KiZen  enclave  (boZom)  eliminates  this  interference  

Page 24: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Excellent  Isola/on  of  NVL/Co-­‐Kernel  Arch  Leads  to  Increased  Performance  and  Reduced  Run-­‐to-­‐run  Variability  

Comparison of Mini-app/Benchmark Performance With and Without a Competing Background Workload (Kernel Compile)

Page 25: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  Overview  

§  Terminology  §  Hybrid  Run-­‐Time  (HRT)  is  a  run-­‐/me  +  applica/on  that  run  en/rely  at  

kernel-­‐level  §  Regular  Opera/ng  System  (ROS)  is  full  blown  OS  stack  (e.g.,  Linux)    §  Hybrid  Virtual  Machine  (HVM)  allows  a  single  VM  to  contain  both  a  

ROS  and  an  HRT  simultaneously,  giving  each  a  dis/nct  view  (and  access)  to  the  resources,  as  well  as  dis/nct  interfaces  to  the  VMM.  

§  We  have  developed  the  core  framework  for  enabling  HRTs  (with  and  without  HVM)  as  well  as  an  ini/al  implementa/on  of  the  HVM    

Page 26: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  Model  Parallel  App  

Parallel  Run-­‐/me  

General  Kernel  

Node  HW  

User  Mode  

Kernel  Mode  

Parallel  App  

Hybrid  Run-­‐/me  (HRT)  

Node  HW  

Kernel  Mode  

Parallel  Run-­‐/me  

General  Kernel  

Node  HW  

User  Mode  

Kernel  Mode  

Parallel  App  

Hybrid  Run-­‐/me  (HRT)  

User  Mode  

Kernel  Mode  

Hybrid  Virtual  Machine  (HVM)  

Specialized  Virtualiza/on  Model  

General  Virtualiza/on  Model  

Performan

ce  Path  

Parallel  App  

Legacy  Path  

(a) Current Model (b) Hybrid Run-time Model

(c) Hybrid Run-time Model Within a Hybrid Virtual Machine

Performan

ce  Path  

Page 27: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  Nau/lus  Kernel  Framework  

§  A  new  kernel  developed  specifically  to  enable  crea/on  and  por/ng  of  HRTs  §  Kyle  Hale’s  thesis  work  

§  Currently  running  on:  §  x64  bare  metal  (64  cores)  §  Intel  Xeon  Phi  (3120A,  228  cores)  §  Palacios  HVM  

§  With  ports  of:  §  Legion  run-­‐/me  and  circuit  

simula/on  applica/on  §  NESL  run-­‐/me  (VCODE  engine)  §  NDPC  (home-­‐grown  nested  data  

parallel  language)  

§  ~24K  SLOC  (C,C++,Assembly)  §  <1K  SLOC  needed  to  enable  RT  

ports  

Page 28: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  Nau/lus  Ini/al  Performance  (x64)  

0

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Cyc

les

Threads

(a)

NautilusLinux

0

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Cyc

les

Threads

(b)

NautilusLinux

0

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Cyc

les

Threads

(c)

NautilusLinux

8 10 12 14 16 18 20 22 24 26

2 4 8 16 32 64

Speedup

Threads

(d)

10

15

20

25

30

35

40

2 4 8 16 32 64

Speedup

Threads

(e)

0 5

10 15 20 25 30 35 40 45

2 4 8 16 32 64

Speedup

Threads

(f)

Figure 4: Average (a), minimum (b), and maximum (c) time to create a number of threads in sequence. Average (d), minimum(e), and maximum (f) speedup of Nautilus over Linux for multiple thread creations.

single nodes as core counts continue to scale up. The intro-duction of variance by OS noise (not just by asynchronouspaging events) not only limits the performance and pre-dictability of existing run-times, but also limits the kindsof run-times that can take advantage of the machine. Forexample, run-times that need tasks to execute in synchrony(e.g., in order to support a bulk-synchronous parallel appli-cation or a run-time that uses an abstract vector model) willexperience serious degradation if OS noise comes into play.

The use of a single unified address space also allows veryfast communication between threads, and eliminates muchof the overhead of context switches when Nautilus bootswith preemption enabled. The only preemption is betweenkernel threads, so no page table switch ever occurs. This isespecially useful when Nautilus runs virtualized, as a largeportion of VM exits come from paging related faults anddynamic mappings initiated by the OS, particularly usingshadow paging. A shadow-paged Nautilus exhibits the min-imum possible shadow page faults, and shadow paging canbe more e�cient that nested paging, except when shadowpage faults are common.

Events Events are a common abstraction that run-timesystems often use to distribute work to execution units, orworkers. The Legion run-time makes heavy use of them, sowe wanted to make sure that Nautilus provided an e�cientimplementation of them. In Legion, the events are usedto notify logical processors (Legion threads) when there aretasks ready to execute. To help show the potential of Legion+ Nautilus as an HRT, we measured the performance ofthese “wakeup” events.

Figure 6 shows the average latency between an event no-tification and the subsequent wakeup. Here, we had a singlethread on one core go to sleep and wait for an event no-tification from a thread running on the adjacent physicalcore. The latency is measured in cycles and the average istaken over 100 runs. The first box on the left shows thelatency of a common mechanism used in Linux for eventnotification, the pthread implementation of condition vari-

0

5000

10000

15000

20000

25000

30000

Linux N. MWAIT N. condvar N. w/kick

Cyc

les

not available in userspace

overhead too high in userspace

Figure 6: Average event wakeup latency.

ables. In this case, the measurement is the time betweencalling pthread_cond_signal and the subsequent wakeupfrom pthread_cond_wait. A wakeup takes about 25000 cy-cles. The following three boxes show various Nautilus imple-mentations of event notification. “N.MWAIT” shows the la-tency when using the newer MONITOR/MWAIT extensions pro-vided by modern processors. These instructions allow onethread to go to sleep on a range of memory, waiting for awrite to that memory by another thread. Note that theMONITOR/MWAIT instructions are not available in user-space.While the latency improves considerably over pthread’s con-dition variables in Linux user-space, we suspect it is limitedby the hardware latency incurred when waking up from asleep state. The MONITOR/MWAIT extensions provide optionalhints to enter lower sleep states and allow faster wakeups,but our machine does not support this feature.The final two boxes (“N. condvar” and “N. w/kick”) show

the Nautilus implementation of condition variables. Theyare very lightweight, and a signal will essentially just en-queue the waiting thread on a processor’s run queue. This

§  Microbenchmarks  §  Thread  crea/on  11x  faster  than  

Linux/pthreads  §  Including  thread  fork  capability  

§  Event  wakeup  5x  faster  than  Linux/pthreads  (right)  

§  Macrobenchmarks  §  Legion  circuit  simula/on  benchmark  

already  5%  faster  at  64  cores  

§  Both  due  to  leveraging  kernel-­‐mode  only  features  and  avoiding  all  kernel/user  transi/ons  

Latency  of  Waking  up  a    Thread  on  a  Remote  Core  

Page 29: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  HVM  in  Palacios  §  Par//ons  VM’s  cores  into  ROS  cores  and  HRT  cores  §  HRT  core  view  

§  Specialized  fast  bootstrap  of  HRT  ELF  (in  microseconds)  §  No  BIOS,  etc.  §  Immediate  startup  in  preconfigured  long  mode  environment  (co-­‐designed  with  Nau/lus)  §  Extended  Mul/boot2  model  §  Environment  leads  to  few  VM  exits  during  typical  bootup/execu/on  §  Possible  prebuilt  VM  structures  (e.g.  page  tables)  and  selec/ve  direct  hardware  access  

further  reduce  VM  exits  §  Not  a  target  of  typical  interrupts,  etc.  §  All  memory  of  VM  accessible  §  All  APICs  of  VM  accessible  §  Separately  rebootable  from  ROS  cores  

§  ROS  core  view  §  Tradi/onal  VM’s  view  (BIOS  boot,  ACPI,  devices,  etc),    just  smaller  §  Not  all  memory  of  VM  accessible/visible  (HRT  can  hide  memory)  §  Not  all  APICs  accessible  (HRT  cores’  APICs  cannot  be  targeted)  

§  ~2K  SLOC  addi/ons  to  Palacios  (so  far)  

Page 30: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hybrid  VMM:  HVM  in  Palacios  

§  Ini/al  bootstrap  /mings  show  HRT  core  reboot  /me  is  similar  to  fork()  and  exec()  

Item Cycles (exits) Time (AMD 4122)

HRT core boot of Nautilus to main()

~135K (7 exits) 61 uS

Linux fork() ~320K 145 uS

Linux exec() ~1M 476 uS

Linux fork()+exec() ~1.5M 714 uS

HRT core boot of Nautilus to idle thread

~37M (~2300 exits) 17 ms

Page 31: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Legion  on  KiZen  

§  Successfully  integrated  a  shared-­‐memory  version  of  the  Legion  run/me  into  the  KiZen  environment  with  minimal  effort.  

§  Successfully  ran  LGNCG,  a  Legion  port  of  the  HPCG  benchmark  code,  on  KiZen  (single  node).  

§  Distributed  LGNCG  with  Legion  on  KiZen  environment    to  Peter  Dinda  For  his  Legion  inves/ga/on.    

Page 32: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

GLOBAL  INFORMATION  BUS  

Page 33: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

GIB  Ac/vi/es  

§  GIB  “design-­‐a-­‐thon”  mee/ng  3/18/15,  hosted  by  MaZhew  Wolf  at  GaTech  §  Representa/on  from  both  ARGO  and  Hobbes  §  Status  updates  from  project  par/cipants,  including:  

§  BEACON  over  EVPath  §  LAMMPS  composi/on  use  case  from  Hobbes  ROSS  paper  

§  Discussion  of  poten/al  GIB  data  store  founda/ons  §  Sandia’s  Kelpie,  several  open  source  packages,  and  Proac/ve  Data  Store  (PDS)/

Drim  §  Reports  about  preliminary  experience  and  experimenta/on  with  some  open  

source  packages  e.g.,  Redis  and  mongoDB  §  Decided  to  pursue  design  using  PDS  as  founda/on  

§  Detailed  discussion  of  GIB  use  cases  §  Boot  §  System  monitoring  §  Composed  applica/on  (launch,  normal  opera/on,  response  to  failures/faults,  

graceful  shutdown)  

Page 34: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

GIB  Ac/vi/es  (II)  

§  Post-­‐mee/ng  ac/ons  §  Created  repositories  for  mee/ng  documents,  data  store  source  code/

documenta/on  §  [ongoing]  Conver/ng  whiteboard  snapshots  describing  use  case  

discussions  into  electronic  documents  for  dissemina/on  and  review  §  ARGO/Hobbes  telecon  4/13/15  

§  Outbrief  for  people  who  weren’t  in  aZendance  at  GaTech  mee/ng  §  More  discussion  regarding  data  store  interface  and  implementa/on  

§  Sugges/on  to  consider  Riak,  leading  to  some  explora/on  with  Riak  open  source  version  

§  Ques/ons  about  PDS  and  its  integra/on  into  EVPath  §  Ongoing  ac/vi/es  

§  Design  and  preliminary  implementa/on  of  data  store  §  Explora/on  of  using  Riak  §  Explora/on  of  integra/ng  PDS  with  BEACON  

§  Development  of  GIB  use  case  document  

Page 35: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Support  for  extreme-­‐scale  OS/R  monitoring  and  control  

§     •  Problem

–  Operating system/runtime (OS/R) components running throughout system must be monitored and controlled, but extreme system scale makes it difficult to do so (too much data, and/or too many “hops” to get data from one part of system to another)

•  Solution –  Integrate scalable, distributed data store with publish and subscribe service in a Global Information

Bus (GIB) –  Interface with Hobbes Leviathan

node-level resource manager

–  Recent progress –  Defined important GIB use cases

–  System boot –  Launch application –  Respond to application failure –  Respond to application termination

–  Designed and began pilot implementation of integration of distributed data store based on Riak open source database, BEACON publish-subscribe software from ARGO project, and Leviathan

•  Impact –  Supports monitoring and control of a large number of system software components without

excessive application intrusion –  Usable by both Hobbes and ARGO projects

Data$Store$ Data$Store$Data$Store$

Leviathan$

Applica2on(s)$

BEACON$

Leviathan$

Applica2on(s)$

BEACON$

Leviathan$

Applica2on(s)$

BEACON$

BEACON$ BEACON$ BEACON$

GIB data store and publish/subscribe components Dashed lines indicate potential notifications from publishers to subscribers

Page 36: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

RESILIENCE  

Page 37: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hobbes  Resilience  Effort  §  OS/R  data  structure  resilience  

§  Hashes,  redundant  data,  self  checks,  and  self  repairs  §  HR/HI  component:  Rejuvena/on/migra/on  

§  OS/R  resilience  building  blocks  (reuse  exis/ng  solu/ons)  §  Membership  management  protocols  with  different  consistency  §  Persistent  state  management  (resilient  distr.  key/value  stores)  §  Reliable/unreliable  publish/subscribe  event  no/fica/on  APIs  

§  Tunable  resilience  and  cross-­‐layer/-­‐enclave  coordina/on  §  Cost  model  of  different  cross-­‐layer/-­‐enclave  resilience  choices  §  Inter-­‐enclave  workflow  coordina/on  that  annotates  data  §  HR/HI  component:  Interfaces  for  autonomic  management  

§  OS/R  support  for  fault  sensi/vity  and  coverage  analysis  §  HR/HI  component:  Support  for  fault  injec/on  and  controlled  experiments  

Page 38: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

Hobbes  Resilience  Update  §  OS/R  data  structure  resilience  

§  Tracking  memory  alloca:ons  within  the  Linux  OS  to  record  supplemental  loca/on  info.  on  generic  alloca/ons,  including  the  reques/ng  module  and  source  file/line  

§  Protec:ng  a  basic  slab  memory  allocator  from  corrup:on  using  data  corrup/on  detec/on  and  correc/on  based  on  already  exis/ng  pointer  redundancy  

§  Transparently  store  resilience  metadata  alongside  dynamically  allocated  OS  data  for  providing  data  structure  resilience  on  an  as  needed  basis.  

§  OS/R  support  for  fault  sensi/vity  and  coverage  analysis  §  Review  of  the  current  status  of  the  Linux  fault  injec:on  support  revealed  great  

poten/al  for  extension  to  permit  injec/on  of  errors  in  specific  Hobbes  OS  subsystems  

§  Ini:al  work  focused  on  injec:ng  fatal  failures  in  the  guest  OS  to  iden/fy  the  isola/on  capabili/es  of  the  virtualiza/on  environments  KVM,  QEMU  and  Palacios  

§  Testbed  system  for  Hobbes  resilience  and  other  Hobbes  effort  §  Deployed  KiQen  and  Palacios  (with  Linux  as  host  OS)  on  bare  hardware  and  on  

QEMU  on  a  960-­‐core  computer  science  research  cluster  at  ORNL  

Page 39: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

mini-­‐ckpts:  Surviving  OS  Failures  in  Persistent  Memory  

§     •  Problem

–  A failure of the operating system (OS) causes a failure of an otherwise healthy HPC application

•  Solution –  Execute the application in persistent memory (PRAMFS

in DRAM) that is able to survive OS failures and reboots –  Track OS state used by the application and MPI for recovery –  Rejuvenate (warm reboot) the OS in case of a failure –  Restore tracked OS state used by the application and MPI –  Transparently continue to execute the application in

persistent memory without loss of state/progress

–  Recent results –  Prototype implementation supports OpenMP and

MPI applications with certain limitations –  OS rejuvenation and recovery takes 3-6 seconds –  Failure-free runtime overhead is of 3-5% for a

number of key HPC workloads •  Impact

–  First solution that transparently offers OS failure tolerance without loss of state/progress –  Transparently handling OS failures locally reduces the need for global checkpoint/restart –  Latent OS errors that have not resulted in a failure can be cleared by rejuvinating the OS

0

50

100

150

200

NPB CG

NPB LU

NPB EP

NPB IS

PENNANT

Clover Leaf

Runt

ime

in S

econ

ds

Open MPIlibrlmpi

librlmpi (with PRAMFS remap)librlmpi (mini-ckpts & PRAMFS)

0

10

20

30

40

50

0 1 2 3 4

Addi

tiona

l Run

time

in S

econ

ds

Number of Kernel Panic Injections

Same Target CGAlternating Target CG

Same Target ISAlternating Target IS

Same Target LUAlternating Target LU

Page 40: Hobbes:’’ OS’and’Run/me’Supportfor’ Applicaon’Composi/on’’xstack.sandia.gov/hobbes/files/Hobbes-Dec8.pdf · Sandia National Laboratories is a multi-program laboratory

hZp://xstack.sandia.gov/hobbes  


Recommended