+ All Categories
Home > Documents > Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic...

Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic...

Date post: 27-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
22
Exascale Architecture Trends Pete Beckman Argonne Na0onal Laboratory Northwestern University
Transcript
Page 1: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Exascale Architecture Trends

Pete  Beckman          Argonne  Na0onal  Laboratory          Northwestern  University    

Page 2: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

HPC has been pretty successful…

2  

Tianhe-­‐2  

Sequoia   K  Computer  

2  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 3: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Old Wisdom: Moore’s Law = free exponential speedups!

3  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 4: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Reality: Computing improvements have slowed dramatically over the past Decade

*”No  Moore?”,  Economist,  Nov  2013.  Src:  Linley  Group  

Transistors  you  can  buy  for  a  fixed  #  of  dollars  in  leading  technology  is  no    longer  increasing!  

Single  thread  performance    improvement  is  slow.  (Specint)  

*”Intel  has  done  a  liNle  beNer  over  this  period,    Increasing  at  21%  per  year.  

Courtesy:  Andrew  Chien  

"Herbert  Stein's  Law:  "If  something  cannot  go  on  forever,  it  will  stop,"  

4  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 5: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

5  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 6: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Old Wisdom: Efficient Algorithms minimize operations

Classic  Analysis  of  Algorithms:    Ops  =  Time            Make  algorithm  quicker:  minimize  flops,  compares                  Ops:  Best,  Worst,  Average,  Space  

1996  6  

Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 7: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Reality: Efficient = optimize data movement (and power)

1"

10"

100"

1000"

10000"

DP"FLOP"

Register"

1mm"on3chip"

5mm"on3chip"

15mm

"on3chip"

Off3chip/DRAM

"

local"interconnect"

Cross"system

"

2008"(45nm)"

2018"(11nm)"

Picojoules*Per*64b

it*op

era2

on*

Comparing  Data  Movement  to  Opera0ons  

Pipelining,  load/store,  GPGU…  Courtesy:  Peter  Kogge    

Courtesy:  John  Shalf  

7  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 8: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Old Wisdom: Parallel Algorithms: Equal Work = Equal Time (computers run at predictable speeds)

SPMD  Code:    Divide  data  into  equal  sized  chucks  across  p  processors    For  all  dmesteps  {            exchange  data  with  neighbors            compute  on  local  data            barrier  }  

8  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 9: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

We live with dynamic now…

9  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

115W  Limit  110W  Limit  

Page 10: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Reality: Performance is Highly Variable

Memory  Hierarchy  Depth  (1-­‐150-­‐?)  

0

0.5

1

1.5

2

2.5

3

2009 2015

Base

Turbo Courtesy:  McCalpin  

+  new  Non-­‐voladle  memory  (3,000  cycles)    +  old  Non-­‐voladle  memory  Flash  (150,000  cycles)  

Turbo  Boost  (1.2-­‐2.5-­‐?)  

Courtesy:  Andrew  Chien  10  

Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 11: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

The New Exascale Reality

§  Compudng  rapidly  gets  faster  and  cheaper  for  free  –  Rapid  exponendal  improvement  is  over,  slow  improvement  will  

condnue  for  awhile...    Parallelism  explodes,  SQUEEEEZE!  

§  Efficient  programs  minimize  operadons  –  More  operadons  can  beNer,  opdmize  for  locality,  data  movement,  

power  

§  Computers  run  at  fixed,  predictable  speed  –  Increasing  dynamic  and  flexible,  complicadon  and  advantage  

11  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 12: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

What Prevents Scalability? (in the large and in the small)

§  Insufficient  parallelism  –  As  the  problem  scales,  more  parallelism  must  be  found  

§  Insufficient  latency  hiding  –  As  the  problem  scales,  more  latency  must  be  hidden  

§  Insufficient  resources  (Memory,  BW,  Flops)  –  As  the  problem  scales,  so  must  the  resources  needed  

12  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 13: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

As  we  scale  machine,  system  becomes  more  dynamic  As  we  squeeze  power,  system  becomes  more  dynamic  As  we  address  resilience,  system  becomes  more  dynamic  As  we  share  networks,  system  becomes  more  dynamic  

What Prevents Scalability? (in the large and in the small)

§  Insufficient  parallelism  –  As  the  problem  scales,  more  parallelism  must  be  found  

§  Insufficient  latency  hiding  –  As  the  problem  scales,  more  latency  must  be  hidden  

§  Insufficient  resources  (Memory,  BW,  Flops)  –  As  the  problem  scales,  so  must  the  resources  needed  

13  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 14: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Our Hardware is Dynamic, Adaptive Today! (the future is even more dynamic)

14  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

•  Bulk  Synchronous  is  our  scaling  problem  •  ≠MPI  (library  that  moves  data  with  put/get  or  send/recv)  •  We  must  focus  on  dynamic  behavior  

•  “OS  Noise”  and  “ji`er”  is  a  legacy  distrac0on  •  OS  &  Run0me  must  be  VERY  ac0ve…  

•  Load  balancing  is  necessary,  but  not  sufficient…  

•  How  do  we  design  sorware  in  this  new  era?  •  How  do  we  build  latency  tolerant  algs?  •  Can  we  create  tools  that  measure,  learn,  

predict,  and  then  improve  performance?  

Page 15: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

But yet, We Pretend our World is Not Dynamic

§  Trinity/NERSC-­‐8:    “The  system  shall  provide  correct  and  consistent  rundmes.    An  applicadon’s  rundme  (i.e.  wall  clock  dme)  shall  not  change  by  more  than  3%  from  run-­‐to-­‐run  in  dedicated  mode  and  5%  in  producdon  mode.“  

 

15  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

 ASCAC  Top  10  Research  Challenges  for  Exascale  • “[…]  power  management  [..]  through  dynamic  adjustment  of  system  balance  to  fit  within  a  fixed  power  budget”  

• […]  Enabling  […]  dynamic  opdmizadons  […]  (power,  performance,  and  reliability)  will  be  crucial  to  sciendfic  producdvity.  “  

• “  […]  Next-­‐generadon  rundme  systems  are  under  development  that  support  different  mixes  of  several  classes  of  dynamic  adapdve  funcdonality.  “  

“dynamic”  mendoned  43  dmes  in  86  pg  report  

Page 16: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Lessons for the Future

§  Code  should  be  as  sta0c  as  possible,  but  no  more  so  §  1)  Prepare:  Create  flexibility  via  over-­‐decomposidon,  clear  expression  of  

dependencies  §  2)  Take  small  steps  to  becoming  more  pliable….  stadcally  

–  (stadc)  mapping  of  resource  (slow/fast;  heat)  –  (stadc)  load  balancing  (periodic  reparddoning)  –  (stadc)  dependency  graph  dling  of  stencils  to  match  communicadon  

 §  3)  Find  goal-­‐oriented  op0miza0on  –  Dynamic  lightweight  work-­‐sharing  –  Dynamic  power  management  –  Dynamic  data  movement  across  hierarchy  

Code  should  not  consider  dynamic  a  performance  error  (e.g.  NERSC)  

16  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 17: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

17  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

ANL:  Pete  Beckman,  Marc  Snir,  Pavan  Balaji,  Rinku  Gupta,  Kamil  Iskra,      Franck  Cappello,  Rajeev  Thakur,  Kazutomo  Yoshii    

LLNL:  Maya  Gokhale,  Edgar  Leon,  Barry  Rountree,  Mardn  Schulz,  Brian  Van  Essen  PNNL:  Sriram  Krishnamoorthy,  Roberto  Gioiosa  UC:  Henry  Hoffmann  UIUC:  Laxmikant  Kale,  Eric  Bohm,  Ramprasad  Venkataraman  UO:  Allen  Malony,  Sameer  Shende,  Kevin  Huck  UTK:  Jack  Dongarra,  George  Bosilca,  Thomas  Herault  

New  abstracdons  &  implementadons    

Exascale  Opera0ng  System  

Page 18: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Argobots

§  Lightweight  Low-­‐level  Threading/Tasking  Framework  

§  Separadon  of  abstracdon  and  mapping  to  implementadon  

§  Massive  parallelism  –  Exec.  Streams  guarantee  progress  –  Work  Units  execute  to  compledon  

§  Clearly  defined  memory  semandcs  –  Consistency  domains  

•  Provide  Eventual  Consistency  –  Sorware  can  manage  consistency  –  Work  Units  can  access  any  

consistency  domain  –  Support  explicit  memory  placement    

and  movement  §  Put/Get/Send/Recv  requires  library  

call  in  OSR,  but  could  be  transparent  at  applicadon  level  

§  Exploring  fault  model  and  atomics  

Consistency  Domain   CD0  

Consistency  Domain  

CD1  

Consistency  Domain  

CD1  

Consistency  Domain  

CD1  

Cache-­‐Coherent  Memory  

Non-­‐Coherent  Memory  

Work  Unit  

Execudon  Stream  

Execu0on  Model  

18  18  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 19: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Task  Schedulers+Argobots  §  Task  scheduling  built  on  tasklets  and  user-­‐level  

threads  in  argobots  §  Focus  on  two  classes  of  task  graphs  

–  Fork-­‐join  computadons  –  Compact  DAG  representadons  

§  Exploit  the  scheduling  characterisdcs  of  argobots  –  Control  over  mapping  threads  to  cores  –  Control  over  scheduling  –  Split-­‐phase  communicadon  and  task  scheduling  

§  Inidal  Implementadon  –  Argobots-­‐opdmized  Cilk  scheduler  –  Parallel  Task  Graph  Engine  (PTGE)    

Threading-aware Task Schedulers

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

19  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 20: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

20  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Cache  

Our  Future  is  Memory  Hierarchy    (adding  dynamic  behavior)  

In-­‐Package  

Main  RAM   XPoint  

NAND  NVRAM   Disk  

Page 21: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Conclusions: The Times They are A-Changin’

§  Embrace  DYNAMIC!  –  Work  ≠    Time  

§  Opdmize  algorithms  for  data  movement  §  Imagine  muldple  memory  allocators  

– Manual  data  movement  §  Learn  to  love  rundme  systems  §  Explore  adapdve,  learning,  predicdve  sorware  stacks  that  takes  humans  out  of  the  loop…  –  Sorry  humans,  you  are  too  slow.  –  Reject  human  tuning  papers…  

21  21  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  

Page 22: Exascale Architecture Trends - MCSpress3.mcs.anl.gov/atpesc/files/2015/08/Beckman-Dinner...Classic Analysis of Algorithms: Ops = Time Make algorithm quicker: minimize flops, compares

Questions?

22  Pete  Beckman                  Argonne  Na0onal  Laboratory  /  Northwestern  University  


Recommended