+ All Categories
Home > Documents > 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7...

2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7...

Date post: 07-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
39
Transcript
Page 1: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors
Page 2: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

2        www.parallel.illinois.edu  

Parallel  @  Illinois  

Illiac IV

UPCRC

Cloud Computing Testbed

OpenSparc Center of Excellence

CUDA Center of Excellence

Extreme Scale Computing

Page 3: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

3        www.parallel.illinois.edu  

Blue  Waters  

•  Sustained  petaflop/s  on  complex  applica7ons  (QCD,  turbulence,  molecular  dynamics,…)  

•  >  200,000  cores  •  >  800  TB  memory  

•  >10  PB  disk  •  >  500  PB  tape  •  100-­‐400  Gbps  external  BW  

•  IBM  Power  7  technology  

Page 4: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

4        www.parallel.illinois.edu  

POWER7: IBM’s Next Generation, Balanced POWER Server Chip

7

POWER7: Core Execution Units

2 Fixed point units2 Load store units4 Double precision floating point1 Branch1 Condition register 1 Vector unit1 Decimal floating point unit6 wide dispatch

Recovery Function Distributed1,2,4 Way SMT SupportOut of Order Execution32KB I-Cache32KB D-Cache256KB L2

Tightly coupled to core

Add Boxes

256KB L2

IFUCRU/BRU

ISU

DFU

FXU

VSXFPU

LSU

Hot  Chip  IBM  Presenta7on  

Page 5: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

5        www.parallel.illinois.edu  

Power  7  Chip  POWER7: IBM’s Next Generation, Balanced POWER Server Chip

4

POWER7 Processor Chip

567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM1.2B transistors

Equivalent function of 2.7BeDRAM efficiency

Eight processor cores12 execution units per core4 Way SMT per core32 Threads per chip256KB L2 per core

32MB on chip eDRAM shared L3Dual DDR3 Memory Controllers

100GB/s Memory bandwidth per chip sustained

Scalability up to 32 Sockets360GB/s SMP bandwidth/chip20,000 coherent operations in flight

Advanced pre-fetching Data and InstructionBinary Compatibility with POWER6

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

Page 6: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

6        www.parallel.illinois.edu  

Possible  Power  7  Package  

POWER7: IBM’s Next Generation, Balanced POWER Server Chip

6

POWER7 Design Principles:

Cores:8, 6, and 4-core offerings with up to 32MB of L3 CacheDynamically turn cores on and off, reallocating energyDynamically vary individual core frequencies, reallocating energyDynamically enable and disable up to 4 threads per core

Memory Subsystem:Full 8 channel or reduced 4 channel configurations

System Topologies:Standard, half-width, and double-width SMP busses supported

Multiple System Packages

!"#$%&%"%'()*+,)-,*.'*&%"%'(

2/4s Blades and RacksSingle Chip Organic

High-End and Mid-RangeSingle Chip Glass Ceramic

Compute IntensiveQuad-chip MCM

/)0#123()42+'32""#35)67)"28*")"%+9: ;)0#123()42+'32""#3:

5)<7)"28*")"%+9:;)<7)=#12'#)"%+9:

<)0#123()42+'32""#3:5)/>7)"28*")"%+9:)?2+)040@

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

Page 7: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

7        www.parallel.illinois.edu  

Performance  growths  1,000-­‐fold  every  11  years  

(Kogge)

Can we achieve the next jump?

Page 8: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

8        www.parallel.illinois.edu  

Moore’s  Law  Con7nues  Moore’s Law is Alive and Well

3

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Will continue in coming decade

(Olokotum)

Page 9: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

9        www.parallel.illinois.edu  

Clock  Frequency  Stagnant  But Clock Frequency Scaling

Replaced by Scaling Cores / Chip

4

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

15 Years of exponential growth ~2x year has ended

(Olokotum)

Page 10: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

10        www.parallel.illinois.edu  

Performance Has Also Slowed, Along with Power

5

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Frequency (MHz)

Power (W)

Perf

Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Power is the root cause of all this

Future increases in performance will come only from increases in number of concurrent threads

End  of  Single-­‐Thread  Era  Li[le/no  benefit  from  increased  transistor  count  Decreasing  benefit  from  frequency  increases  Power  limits  reached  

Page 11: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

11        www.parallel.illinois.edu  

Number  of  Cores  Increases  Rapidly  This has Also Impacted

HPC System Concurrency

Exponential wave of increasing concurrency for forseeable future! 1M cores sooner than you think!

6

Sum of the # of cores in top 15 systems (from top500.org)

Sum # cores top 15 systems

A million cores in a couple of years; a billion threads in a decade?

Page 12: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

12        www.parallel.illinois.edu  

Power  Budget  

!

!""#$%#&'%()*+!"",

-.",

-"",

-"",

//.",

0&123&%453%1256'278'459746:259

+2;&'%9<##7=%7299&9

.>,

"#$%&'(

)($#*+

"#$

,-./

8()*+%?41@:5&%6234=

-"8A%3:9B%C%-8AD3:9B%C-",

"E-AD()*+%C%-E.5$%#&'%A=6&

-""#$%12F%#&'%()*+

0123(*452)126('452712894:5 GW for Exaflop/s with today’s technology 100 MW in a decade?

(Borkar)

Page 13: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

13        www.parallel.illinois.edu  

Aggressive  Power  Scaling  

!

The Power & Energy Challenge

!""#

$%"#

$""#

$""#

&%%"#

%'#

"#$%&'(

)($#*+

"#$

,-./

()*+,-./01234-567/8

%#!#9%#9:#%#

()*+,-./01234-5143#251-;</-(40136=6>8

0123

~1B threads Heterogeneous architecture Mostly nearest-neighbor communication Long “cache lines” High error-rates

(Borkar)

20 MW for Exaflop/s with aggressive, likely custom design; and harder to program machine Software to the rescue!

Page 14: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

14        www.parallel.illinois.edu  

Main  Issues  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

14  

Page 15: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

15        www.parallel.illinois.edu  

Managing  1B  threads  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

15  

Page 16: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

16        www.parallel.illinois.edu  

Scaling  ApplicaCons  

Weak  scaling:  use  more  powerful  machine  to  solve  larger  problem  –  increase  applica7on  size    and  keep  running  7me  constant;  e.g.,  refine  grid  

•  Larger  problem  may  not  be  of  interest  –  Iden7fy  problems  that  require  petascale  performance  

•  May  want  to  scale  7me,  not  space  (e.g.,  molecular  dynamics)  –  Study  parallelism  in  7me  domain  

•  Cannot  scale  space  without  scaling  7me  (itera7ve  methods):  granularity  decreases  and  communica7on  increases  

Page 17: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

17        www.parallel.illinois.edu  

Scaling  IteraCve  Methods  •  Assume  that  number  of  cores  (and  compute  power)  

increases  by  factor  of  k  

•  Space  and  7me  scales  are  refined  by  factor  of  k1/4  

•   Mesh  size  increases  by  factor  of  k3/4  

•  Per  core  cell  volume  decreases  by  factor  of  k1/4  

•  Per  core  cell  area  decreases  by  a  factor  of  k1/4×2/3  =  k1/6  

•  Area  to  volume  raCo  (communica7on  to  computa7on  ra7o)  increases  by  factor  of  k1/4/  k1/6  =  k1/12  

•  Per  core  computa7on  is  finer  grained  and  needs  rela7vely  more  communica7on  

•  (Per  chip  computa7on  is  coarser  grained  and  and  needs  rela7vely  less  communica7on  if  most  increase  in  #  cores  is  per  chip)    

Page 18: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

18        www.parallel.illinois.edu  

Debugging  and  Tuning:  Observing  1B  Threads  

•  Scalable  infrastructure  to  control  and  instrument  1B  threads  

•  On-­‐the-­‐fly  sensor  data  stream  mining  to  iden7fy  “anomalies”  

•  Need  to  ability  to  express  “normality”  (global  correctness  and  performance  asser7ons)  

Page 19: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

19        www.parallel.illinois.edu  

Locality  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

19  

Page 20: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

20        www.parallel.illinois.edu  

It’s  the  Memory,  Stupid  

•  CPU  performance  is  determined,  within  10%-­‐20%,  by  trace  of  memory  accesses  [Snavely]  ☛ Algorithm  design  should  focus  on  data  accesses,  not  opera7ons  – Temporal  locality:  cluster  accesses  in  7me  – Spa7al  locality:  match  data  storage  to  access  order(not  vice-­‐versa);  use  par7ally-­‐constrained  iterators  

– Processor  locality:  cluster  accesses  in  processor  space  

Page 21: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

21        www.parallel.illinois.edu  

Theory  Problem:  CommunicaCon  Complexity  

•  Results  exist  in  combinatorial  &  limited  algebraic  models  (sor7ng,  FFT  graph,  n3  matrix  product…);  need  similar  results  for  numerical  algorithms  

•  E.g.,  what  is  trade-­‐off  between  communica7on  and  convergence  rate  in  domain  decomposi7on  methods?  

Page 22: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

22        www.parallel.illinois.edu  

Heterogeneity  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resiliency  • Variability  • Virtualiza7on  • Socializa7on  

22  

Page 23: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

23        www.parallel.illinois.edu  

Hybrid  CommunicaCon  

•  Mul7ple  levels  of  caches  and  of  cache  sharing  •  Different  communica7on  models  intra  and  inter  node  

–  Coherent  shared  memory  inside  chip  (node)  –  rDMA  (put/get/update)  across  nodes  

•  Communica7on  architecture  changes  every  HW  genera7on  •  Need  to  easily  adjust  number  of  cores  &  replace  inter-­‐node  

communica7on  with  intra-­‐node  communica7on  •  Easy  to  “downgrade”  (use  shared  memory  for  message  

passing);  hard  to  “upgrade”;  hence  tend  to  use  lowest  commonality  (message  passing)  

•  No  good  interoperability  between  shared  memory  (e.g.,  OpenMP)  and  message  passing  (MPI)  

Page 24: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

24        www.parallel.illinois.edu  

Possible  DirecCons  

•  Express  cache  oblivious  algorithms  using  recursive  domain  splirng  (a  la  TBB)  – Methods  to  (i)  split  domain;  (ii)  execute  sequen7ally,  if  domain  is  “small”;  and  (iii)  merge  back  

– Need  adapta7on  for  itera7ve  methods  to  reuse  par77on  

–  Leads,  naturally,  to  algorithms  where  communica7on  is  less  frequent  at  tree  root  

– May  provide  2  method  extensions:    •  Distributed  memory  splirng/merging  •  Shared  memory  splirng/merging  

Page 25: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

25        www.parallel.illinois.edu  

Hybrid  ComputaCon  

•  Vector/SIMD  instruc7ons  •  Different  core  types  •  Accelerators  

•  Can  significantly  reduce  energy  per  flop  •  Require  (now)  different  source  code  •  Easy  to  compile  CUDA  to  mul7core  (downgrade)  ;  hard  to  compile  general  OpenMP  code  to  GPU  (upgrade)  

Page 26: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

26        www.parallel.illinois.edu  

GPU  as  a  DisrupCve  Technology  

•  Disrup7ve  technology:  “good  enough”  cheaper  technology  that  replaces  be[er,  more  expensive  one,  star7ng  with  the  low-­‐end  and  expanding  upward  (Christensen)  –  Kills  be[er  technology,  before  it  can  really  replace  it  at  the  very  high  end  –  HPC  is  high-­‐end  

•  GPU  is  a  disrup7ve  technology:  it  will  either  kill/swallow  the  CPU  or  be  swallowed  by  it  

•  Probable  long-­‐term  outcome:  7ghtly  coupled  cores  with  homogeneous  architecture  but  heterogeneous  performance  that  are  not  normally  used  concurrently  

•  Warning:  The  arguments  in  favor  of  hybrid  architectures,  have    not  changed  in  the  last  30  years.    

Page 27: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

27        www.parallel.illinois.edu  

Do  You  Trust  Your  Results?  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

27  

Page 28: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

28        www.parallel.illinois.edu  

Resilience  •  Transient  error  are  more  frequent:  

–  More  transistors  –  Smaller  transistors  –  Lower  voltage  –  More  manufacturing  variance  

•  Error  detec7on  is  expensive  (e.g.,  nVidia  vs.  Power  7)  •  Checkpoint/restart,  as  currently  done,  does  not  scale  

•  Need,  new,  more  scalable  error  recovery  algorithms  •  Supercomputers  built  of  low-­‐cost  commodity  

components  may  suffer  from  (too)  high  a  rate  of  undetected  errors.  –  Will  need  souware  error  detec7on  or  fault-­‐tolerant  algorithms  

Page 29: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

29        www.parallel.illinois.edu  

Plus  ca  change,  moins  c’est  la  meme  chose  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

29  

Page 30: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

30        www.parallel.illinois.edu  

Bulk  Synchronous  

•  Many  parallel  applica7ons  are  wri[en  in  a  “bulk-­‐synchronous  style”:  alterna7ng  stages  of  local  computa7on  and  global  communica7on  

•  Models  implicitly  assumes  that  all  processes  advance  at  the  same  compute  speed  

•  Assump7ons  breaks  down  for  an  increasingly  large  number  of  reasons  –  Black  swan  effect  – OS  ji[er  – Applica7on  ji[er  – HW  ji[er  

Page 31: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

31        www.parallel.illinois.edu  

JiZer  Causes  •  Black  swan  effect  

–  If  each  thread  is  unavailable  (busy)    for  1  msec  once  a  month,  than  most  collec7ve  communica7ons  involving  1B  threads  take  >  1  msec  

•  OS  ji[er  –  Background  OS  ac7vi7es  (daemons,  heartbeats…)  

•  HW  ji[er  –  Background  error  recovery  ac7vi7es  (e.g.,  memory  error  correc7on,    memory  scrubbing,  reexecu7on);  power  management;  management  of  manufacturing  variability;  degraded  opera7on  modes  

•  Applica7on  ji[er  –  Input-­‐dependent  variability  in  computa7on  intensity  

•  Need  to  move  away  from  bulk  synchronous  model  

Page 32: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

32        www.parallel.illinois.edu  

Possible  Approaches  

•  Eliminate  unneeded  synchroniza7ons  –  Code  compila7on/refactoring  for  added  asynchrony  

•  Need  be[er  analysis  tools  to  iden7fy  cri7cal  path  (“read/post  early;  use  late”  may  not  work)    

– Dynamic    scheduling  (e.g.,  Dongarra  latest  LU  codes)  –  Virtualiza7on  (e.g.,  Charm++)  

•  Reduce  needed  producer-­‐consumer  synchroniza7ons  or  stretch  it  in  7me  –  Theory  ques7on:  how  delayed  updates  affect  convergence  rate  of  itera7ve  solvers?  

Page 33: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

33        www.parallel.illinois.edu  

Task  VirtualizaCon  •  Mul7ple  logical  tasks  are  scheduled  on  each  physical  core;  tasks  are  scheduled  nonpreemp7vely;  task  migra7on  is  supported  – Hides  variance  and  communica7on  latency  – Helps  with  scalability  (decouples  #  tasks  from  #  cores)  

– Helps  with  resiliency  – Needed  for  modularity  (mul7physics/mul7scale  codes  –  handling  parallel  coupling  of  modules)  

–  Improves  performance  (be[er  locality)  – Scales  (Charm++/AMPI)  – Can  be  implemented  below  MPI  or  PGAS  languages  

33  

Page 34: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

34        www.parallel.illinois.edu  

Task  VirtualizaCon  Styles  

•  Varying,  user  controlled  number  of  tasks  (Charm++)  – Locality  achieved  by  load  balancer  

•  Implicit  tasks:  e.g.,  TBB  

– Locality  is  achieved  implicitly    

34  

Page 35: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

35        www.parallel.illinois.edu  

On  the  Need  for  Culture  Change  

• Increased  parallelism    

• Need  for  locality  • Heterogeneity  • Resilience  • Variability  • Virtualiza7on  • Socializa7on  

35  

Page 36: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

36        www.parallel.illinois.edu  

Big  Systems  Are  Expensive  

•  1%  performance  gain  on  a  4  week  run  =  $100,000.  Are  we  willing  to  invest  a  man-­‐year  to  get  it?  

•  Would  we  have  our  undergraduate  students  implement  a  major  experiment  at  CERN?  

•  Major  supercompu7ng  applica7on  codes  should  be  developed  by  professional  teams  that  include  specialized  engineers  –  including  a  performance  engineer  and  a  SW  architect  –  Incen7ves  should  encourage  this  model  

Page 37: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

37        www.parallel.illinois.edu  

Good  Engineers  Need  Good  Tools  

Need  integrated  development  environments  •  Expert  friendly  tools  for  good  engineers  –  a  steep  learning  curve  is  

necessary  (no  easy  way  to  learn  brain  surgery)    •  Analysis,  debugging  and  performance  tools  are  fully  integrated  in  

development  environment  at  all  levels  of  code  crea7on/refactoring  –  correctness/performance  informa7on  is  presented  in  terms  of  

programmer’s  interface  –  compiler  analysis  and  performance  informa7on  available  for  

refactoring  •  Support  a  systema7c  methodology  for  performance  debugging  

–  Requires  a  performance  model  •  Will  not  come  from  industry  –  no  market  –  but  can  leverage  

industrial  infrastructure  •  Performance  programming  can  be  made  easier,  but  will  never  be  

easy  –  we  have  not  automated  bridge  building,  either  

Page 38: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

38        www.parallel.illinois.edu  

InternaConal  PoliCcs  of  SupercompuCng  

•  An  exascale  system  – Will  cost  ~$1B  

– Will  consume  20-­‐50MW  – May  use  much  less  commodity  technology  than  current  supercomputers  

– May  not  have  any  military  applica7on  

•  Should  supercompu7ng  be  done  by  interna7onal  consor7a?  

Page 39: 2 V3.pdf · 5 Power&7&Chip& POWER7: IBM’s Next Generation, Balanced POWER Server Chip 4 POWER7 Processor Chip 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM 1.2B transistors

39        www.parallel.illinois.edu  


Recommended