+ All Categories
Home > Documents > Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$...

Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$...

Date post: 04-Aug-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
20
I/O So’ware and Data Alok Choudhary Northwestern University Large Scale CompuAng and Storage Requirements for Advanced ScienAfic CompuAng Research, ASCR / NERSC Workshop, January 56, 2011 1
Transcript
Page 1: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  So'ware  and  Data    Alok  Choudhary  

Northwestern  University  

Large  Scale  CompuAng  and  Storage  Requirements  for  Advanced  ScienAfic  CompuAng  Research,  ASCR  /  NERSC  Workshop,  January  5-­‐6,  2011  

1  

Page 2: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  so'ware  stack  in  HPC  

•  MPI  standard,  uniform  API  to  all  file  systems  •  Lower-­‐level  parallel  I/O  opAmizaAons  

–  Coordinate  processes  to  rearrange  I/O  requests  –  CollecAve  I/O,  data  sieving,  caching,  lock  alignment  

2  

High-­‐level  parallel  I/O  libraries  

Data  model  based  I/O  libraries  

MPI-­‐IO  

POSIX  or  file  system  API  Client-­‐side  file  system  

Server-­‐side  file  system  

Storage  system  

ApplicaAons  

•  Portable  file  format,  self-­‐describing,  metadata-­‐rich  •  Examples:  

–  PnetCDF,  netCDF4,  HDF5  

•  Domain-­‐specific  •  Encapsulate  mulAple  lower-­‐level  I/O  interfaces  

(e.g.  HDF5,  netCDF)  •  Examples:  

–  PIO:  Community  Climate  System  Model  (CCSM)  

–  GIO:  Global  Cloud  Resolve  Model  (GCRM)  

•  Parallel  file  systems  –  Lustre,  PVFS2,  GPFS,  PanFS  –  File  striping,  caching,  consistency  controls,  scalable  

metadata  operaAons,  data  reliability,  fault  tolerance  

Page 3: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  resources  in  NERSC  •  NERSC  Global  Filesystem  (NGF),  an  IBM  GPFS  

–  HPC  •   Franklin  (Cray  XT4),  Hopper  (Cray  XT5),  Hopper  II  (Cray  XE6),  Carver  (IBM)  

–  AnalyAcs  clusters  •  PDSF  (Linux),  Euclid  (Sun)  

–  Others  •  Tesla/Tuning,  Dirac  (GPU  cluster),  Magellan  (Cloud)  

•  Parallel  file  systems  –  Lustre,  GPFS  

•  Archival  Storage  (HPSS)  •  ConsulAng  team  

–  Very  helpful  for  answering  I/O  related  quesAons,  including  hardware  configuraAon,  so'ware  availability,  run-­‐Ame  environment  setup,  system  performance  numbers,  communicaAon  with  Cray,  etc.  

3  

Page 4: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  so'ware  at  NERSC  

•  High-­‐level  I/O  libraries  –  HDF5,  netCDF4,  Parallel  netCDF  

•  I/O  tracing  libraries  –  IPMIO  (profiling  POSIX  calls)  

•  Parallel  I/O  middleware  – MPI-­‐IO  (Cray’s  implementaAon)  

–  ROMIO  with  NWU’s  opAmizaAon  (file  domain  alignment)  

•  Lustre  parallel  file  systems  –  User  customizable  striping  configuraAon,  an  important  feature  for  I/O  developers  

4  

Page 5: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

Case  studies  •  MPI-­‐IO  file  domain  alignments  

–  Reorganize  I/O  requests  to  match  the  Lustre  locking  protocol  –  Significantly  improve  performance  on  Franklin  

•  I/O  delegaAon  –  AddiAonal  set  of  compute  nodes  to  enable  caching,  prefetching,  

aggregaAon  –  I/O  requests  are  forwarded  from  applicaAon  processes  to  delegates  –  Boost  independent  I/O  compeAAvely  to  collecAve  I/O  

•  Parallel  netCDF  non-­‐blocking  I/O  –  Aggregates  many  small  requests  for  becer  bandwidths  

•  Data-­‐model  based  I/O  library  –  Next  generaAon  high-­‐level  I/O  library  –  Supports  data  models  (seven  dwarfs)  with  new  file  formats  and  data  

layouts  

5  

Page 6: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

MPI-­‐IO  file  domain  alignments  

•  Lustre  –  File  system  uses  locks  to  keep  data  consistent  

•  File  domain  alignment  in  collecAve  I/O  –  Minimize  the  number  of  clients  accessing  each  file  server  

•  Two  implementaAons  –  NWU  (single-­‐stage)  published  in  SC08*  –  Cray  MPI-­‐IO  (mulA-­‐stage),  available  in  June  2009  

6  

group 1

P P2 0 1 2 0 1 5 3 4 5 3 4

group 0

sizestripe

0

0

3

3 4

4

1

1

2

2

5

5

S S S

file stripe

0 1 2

from Pfrom Pfrom Pfrom P from P

from Pfrom Pfrom P

from Pfrom Pfrom Pfrom P

I/O servers

aggregate access regionfile

P P P P P P P P P P

file aggregate access region

I/O serversS S S

file stripe

0 1 2

from Pfrom P from P

from Pfrom Pfrom P

from Pfrom Pfrom Pfrom P

5

4

2

0,1

3,4

5

from P1,2

0from P0

1

3

4,5

P0 P2 P3 P4 5PP1

stripesize

*W.  Liao,  and  A.  Choudhary.  Dynamically  AdapAng  File  Domain  ParAAoning  Methods  for  CollecAve  I/O  Based  on  Underlying  Parallel  File  System  Locking  Protocols,  SC  2008  

Page 7: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

Improvement  from  file  domain  alignment  

•  Franklin  @  NERSC  –  Compare  Cray’s  and  NWU’s  MPI-­‐IO  

implementaAons  –  Measured  peak  for  write  is  16GB/sec  

•  S3D  –  CombusAon  applicaAon  from  Sandia  

Lab.  –  Four  global  arrays:  two  3D  and  two  4D  –  Each  process  subarray  size  50x50x50  

•  Flash  –  Astrophysics  applicaAon  from  U.  of  

Chicago  –  I/O  method:  HDF5  –  Each  process  writes  80~82  32x32x32  

arrays  

7  

!"

#"

$"

%"

&"

'!"

'#"

(#" %$" '#&" #)%" )'#" '!#$" #!$&" $!*%" &'*#"

!"#$%&'(

)*+#*$,&#)&-'./%0&

1234%"&56&05"%/&

789&

+,-."/01213" 456"/01213"

!"

#"

$"

%"

&"

'!"

'#"

'$"

(#" %$" '#&" #)%" )'#" '!#$" #!$&" $!*%" &'*#"

!"#$%&'(

)*+#*$,&#)&-./0%1&

234'%"&56&15"%0&

78(0,&

+,-."/01213" 456"/01213"

Page 8: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  delegaAon  

•  Run  inside  of  MPI-­‐IO  

•  Run  on  a  small  set  of  addiAonal  MPI  processes  

•  All  I/O  delegates  collaborate  for  becer  performance  

•  Related  work  –  I/O  forwarding  developed  by  

Rob  Ross’s  team  at  ANL  

8  

A   A  

A   A   A  

A  

A  

A   A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  A   A   A   A   A   A  

A  A   A   A   A   A   A  

MPI  applicaAon  nodes  A  

D  D   D   D   D   D   D  

D   I/O  delegate  nodes  

S   S   S   S   S   S   S   S   S   I/O  servers  

compute  nodes  

Page 9: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

Improvement  from  I/O  delegaAon  

•  Franklin  –  Compared  with  Cray’s  collecAve  I/O  

•  I/O  delegaAon  –  1/8  of  applicaAon  processes  as  I/O  delegates    –  Enhances  independent  I/O  performance  to  be  similar  to  the  collecAve  –  Using  independent  I/O  eases  I/O  programming  –  The  above  results  presented  in  a  paper  accepted  by  IEEE  TPDS*  

9  

S3D   Flash  

*A.  Nisar,  W.  Liao,  and  A.  Choudhary.  Scaling  Parallel  I/O  Performance  through  I/O  Delegate  and  Caching  System,  SC  2008  

Page 10: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

PnetCDF  

•  A  parallel  I/O  library  for  accessing  files  in  CDF  format  –  Version  1.0  released  in  2005,  now  in  version  1.2  

•  OpAmizaAons  –  Built  on  top  of  MPI-­‐IO  

–  File  header  and  dataset  alignment  

–  Non-­‐blocking  I/O  enables  aggregaAon  of  mulAple  requests  

•  Developers  –  NU:  Jianwei  Li  (graduated  in  2006),  Kui  Gao  (postdoc),  Wei-­‐keng  Liao,  

Alok  Choudhary  

–  ANL:  Rob  Ross,  Rob  Latham,  Rajeev  Thakur,  Bill  Gropp  

•  Recent  applicaAon  collaborators  –  GIO  (PNNL),  FLASH  (U.  Chicago),  CCSM  (NCAR)  

10  

Page 11: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

GCRM  I/O  

•  GIO  (Geodesic  Parallel  IO  API)  developed  at  PNNL  –  Is  an  I/O  library  developed  by  Karen  Schuchardt  from  PNNL  

–  Used  by  Global  Cloud  Resolve  Model  (GCRM)  developed  at  Colorado  State  University  that  simulates  the  global  climate  

–  Provides  user  configurable  I/O  method  •  PnetCDF  •  HDF5  •  NetCDF4  

Pictures  by  courtesy  of  Karen  Schuchardt  

11  

Page 12: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

GCRM  I/O  performance  •  I/O  pacern  

–  Each  process  writes  many,  nonconAguous  data  blocks  for  each  variable  

•  GIO  strategies  –  Direct  and  interleaved  messaging  

methods  to  exchange  I/O  requests  in  order  to  get  becer  performance  

•  PnetCDF  non-­‐blocking  I/O  –  Delay  request  so  small-­‐sized  requests  can  

be  aggregated  into  large  ones  –  Simplifies  the  programming  task  –  Results  were  presented  in  the  Workshop  

on  High-­‐ResoluAon  Climate  Modeling  2010*  

GCRM  run  on  Franklin  @  NERSC  

!"

#"

$"

%"

&"

'!"

%$!" '#&!" #(%!"

)*+,-./0"123"/+/)*+,-./0"123"

45.67")8/9:

.96;"./"<=2>7,"

?@A)75"+B"8CC*.,8D+/"C5+,7>>7>"

E#!"123""C5+,7>>7>"

'%!"123""C5+,7>>7>"

%$!"123""C5+,7>>7>"

12  

Resolu'on  level  

Number  of  cells  

Grid-­‐point  spacing  

9   2.6  Million   15.6  km  

10   10.5  Million   7.8  km  

11   41.9  Million   3.9  km  

*B.  Palmer,  K.  Schuchardt,  A.  Koontz,  R.  Jacob,  R.  Latham,  and  W.  Liao.  IO  for  High  ResoluAon  Climate  Models.  Workshop  on  High-­‐ResoluAon  Climate  Modeling,  2010  

Page 13: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

p0! p1!

p2! p3!

p1! p3! p1! p2!

p0! p1!

p2! p3!

p1!

p3!

p0!

p2!

p1!

p3!

p0!

p2!

Level 0!

Level 1!

Level 2!

(b) Mesh Refinement! (c) Hierarchical grid in box layout!(a) Combined AMR Grid !

!"#$%&%'()&*&)(

+,'&$-&./#'&()&*&)(

0/,&%'()&*&)(

p0! p2!

p2!

p0! p3! p3!p2!p1!p0!

p1!p0! p3!p3! p2!p1!p0!p0!p0!p0! p1!p1!p1! p2! p3! p2! p3! p2!p3!

!"#$%"&'()*$+',-."*/-0'0)*/)'1"20/."3/(*"(*0--"1-4-1,"56"0-78-9-8(""58":"&05)-,,-,"

Chombo  I/O  

•  PDE tool from LBNL –  Supports block-structured AMR grids

•  I/O pattern –  Array variables are partitioned among a

subset of processes –  Calls MPI independent API as the

collective is not feasible •  PnetCDF non-blocking I/O

–  Aggregate requests to multiple variables –  One collective I/O carries out the

aggregated request  !"

#"

$"

%"

&"

'!"

'#"

%$" '#&" #(%" ('#" '!#$" #!$&" $')%" &')#"

*+,-./0"+1+231456+7"

*+,-./0"231456+7"

8/0("231456+7"

9:;2,<"1=">??364>@1+"?<14,AA,A"

B<6-,"2>+CD6C-E"6+"FGHA,4"

Chombo  run  on  Franklin  @  NERSC  

13  

Page 14: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

Data-­‐model  based  I/O  library  (X-­‐stack)  (Pis  Alok  Choudhary,  Wei-­‐Keng  Liao,  Northwestern;  Rob  Ross,  Tim  Tautges,  ANL;  

Nagiza  Samatova,  NCSU;  Quincy  Koziol,  HDF  Group)  •  DAMSEL:  next  generaAon  of  high-­‐level  I/O  library    •  Supports  various  data  models  

–  Describe  unstructured  data  relaAonships:  trees,  graph-­‐based,  space-­‐filling  curve,  etc.  

–  Scalable  I/O  for  irregular  distributed  data  objects  –  More  sophisAcate  data  query  API  

•  Virtual  filing  

14  

–  A  file  container  of  mulAple  files  appears  as  a  single  file  

–  Balance  concurrent  access  (to  reduce  contenAon)  and  the  number  of  files  created  (to  ease  file  manageability)   Space  filling  curves  are  used  in  climate  codes  

when  parAAoning  the  grid  to  improve  scalability.  Image  courtesy  John  Dennis  (NCAR).  

Page 15: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

ComputaAonal  moAfs  

15  

Page 16: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

Future  

•  Challenges:  – Reads  in  data  analysis,  as  most  of  the  case  only  subsets  of  data  are  read  

•  Hardware  accelerators  – GPU  for  on-­‐line  data  compression,  analysis  

•  Faster  storage  device  – SSD  as  a  read  cache  at  compute  nodes  or  I/O  servers  

16  

Page 17: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

AcAve  storage  

•  Offloading  I/O  intensive  computaAon  to  the  file  servers  

•  If  servers  were  equipped  GPUs,  the  operaAons  can  run  faster  –  TS:  tradiAonal  storage,  task  runs  

on  clients  

–  AS:  acAve  storage,  task  runs  on  server’s  CPU  

–  AS  +  GPU:  task  runs  on  server’s  GPU  

17  

KMEANS  

Page 18: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

GPU  to  accelerate  data  analyAcs  

•  Our  I/O  delegaAon  work  demonstrates  that  set-­‐aside  processes  improve  I/O  performance  

•  Delegate  processes  can  also  be  used  for  off-­‐loading  data  intensive  computaAon  

•  HPC  with  a  subset  of  compute  nodes  equipped  with  GPU  and  SSD  provides  a  rich  experimental  plaoorm  

18  

Page 19: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

PublicaAons  •  Alok  Choudhary,  Wei-­‐keng  Liao,  Kui  Gao,  Arifa  Nisar,  Robert  Ross,  Rajeev  Thakur,  

and  Robert  Latham.  Scalable  I/O  and  Analy'cs.  In  the  Journal  of  Physics:  Conference  Series,  Volume  180,  Number  012048  (10  pp),  August  2009.  

•  Kui  Gao,  Wei-­‐keng  Liao,  Arifa  Nisar,  Alok  Choudhary,  Robert  Ross,  and  Robert  Latham.  Using  Subfiling  to  Improve  Programming  Flexibility  and  Performance  of  Parallel  Shared-­‐file  I/O.  In  the  Proceedings  of  the  Interna>onal  Conference  on  Parallel  Processing,  Vienna,  Austria,  September  2009.  

•  Kui  Gao,  Wei-­‐keng  Liao,  Alok  Choudhary,  Robert  Ross,  and  Robert  Latham.  Combining  I/O  Opera'ons  for  Mul'ple  Array  Variables  in  Parallel  NetCDF.  In  the  Proceedings  of  the  Workshop  on  Interfaces  and  Archi-­‐  tectures  for  Scien>fic  Data  Storage,  held  in  conjuncAon  with  the  the  IEEE  Cluster  Conference,  New  Orleans,  Louisiana,  September  2009.  

•  B.  Palmer,  K.  Schuchardt,  A.  Koontz,  R.  Jacob,  R.  Latham,  and  W.  Liao.  IO  for  High  Resolu'on  Climate  Models.  Workshop  on  High-­‐ResoluAon  Climate  Modeling,  2010  

•  Arifa  Nisar,  Wei-­‐keng  Liao,  and  Alok  Choudhary.  Delega'on-­‐based  I/O  SoWware  Architecture  for  High  Performance  Compu'ng  Systems.  Accepted  by  IEEE  TPDS,  2010.  

19  

Page 20: Alok$Choudhary$ Northwestern$University$ · I/O$So’ware$and$Data$ Alok$Choudhary$ Northwestern$University$ Large$Scale$CompuAng$and$Storage$Requirements$for$Advanced$ScienAfic$

I/O  resources  from  NERSC  •  NERSC  Global  Filesystem  (NGF),  an  IBM  GPFS  

–  HPC  •   Franklin  (Cray  XT4),  Hopper  (Cray  XT5),  Hopper  II  (Cray  XE6),  Carver  (IBM)  

–  AnalyAcs  clusters  •  PDSF  (Linux),  Euclid  (Sun)  

–  Others  •  Tesla/Tuning,  Dirac  (GPU  cluster),  Magellan  (Cloud)  

•  Parallel  file  systems  –  Lustre,  GPFS  

•  Archival  Storage  (HPSS)  •  ConsulAng  team  

–  Very  helpful  for  answering  I/O  related  quesAons,  including  hardware  configuraAon,  so'ware  availability,  run-­‐Ame  environment  setup,  system  performance  numbers,  communicaAon  with  Cray,  etc.  

20  


Recommended