Date post: | 24-May-2015 |
Category: |
Technology |
Upload: | top500 |
View: | 7,060 times |
Download: | 2 times |
Critical Issues at Exascalefor Algorithm and Software Design
SC12, Salt Lake City, Utah, Nov 2012Jack Dongarra, University of Tennessee, Tennessee, USA
Performance Development in Top500
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1 Eflop/s
1 Gflop/s
1 Tflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
N=1
N=500
Potential System Architecture
Systems 2012Titan
Computer
2022 DifferenceToday &
2022
System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW(2 Gflops/W)
~20 MW(50 Gflops/W)
System memory 710 TB(38*18688)
32 - 64 PB O(10)
Node performance 1,452 GF/s(1311+141)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 232 GB/s(52+180)
2 - 4TB/s O(1000)
Node concurrency 16 cores CPU2688 CUDA
cores
O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW
8 GB/s 200-400GB/s O(10)
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 50 M O(billion) O(1,000)
MTTI ?? unknown O(<1 day) - O(10)
Potential System Architecturewith a cap of $200M and 20MW
Systems 2012Titan
Computer
2022 DifferenceToday &
2022
System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW(2 Gflops/W)
~20 MW(50 Gflops/W)
System memory 710 TB(38*18688)
32 - 64 PB O(10)
Node performance 1,452 GF/s(1311+141)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 232 GB/s(52+180)
2 - 4TB/s O(1000)
Node concurrency 16 cores CPU2688 CUDA
cores
O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW
8 GB/s 200-400GB/s O(10)
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 50 M O(billion) O(1,000)
MTTI ?? unknown O(<1 day) - O(10)
Critical Issues at Peta & Exascale for Algorithm and Software Design¨ Synchronization-reducing algorithms
Break Fork-Join model¨ Communication-reducing algorithms
Use methods which have lower bound on communication
¨ Mixed precision methods 2x speed of ops and 2x speed for data movement
¨ Autotuning Today’s machines are too complicated, build
“smarts” into software to adapt to the hardware¨ Fault resilient algorithms
Implement algorithms that can recover from failures/bit flips
¨ Reproducibility of results Today we can’t guarantee this. We understand the
issues, but some of our “colleagues” have a hard time with this.
5
• Must rethink the design of our algorithms and softwareManycore and Hybrid architectures are
disruptive technologySimilar to what happened with cluster
computing and message passing
Rethink and rewrite the applications, algorithms, and software
Data movement is expensiveFlops are cheap
Major Changes to Algorithms/Software
6
Fork-Join Parallelization of LU and QR.
Parallelize the update:• Easy and done in any reasonable software.• This is the 2/3n3 term in the FLOPs count.• Can be done efficiently with LAPACK+multithreaded BLAS
-dgemm
TimeCor
es
Synchronization (in LAPACK LU)
fork join bulk synchronous processing
8Allowing for delayed update, out of order, asynchronous, dataflow execution
¨Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms
¨Methodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout
¨Arbitrary DAG with dynamic scheduling
9
Fork-joinparallelism
PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures
DAG scheduledparallelism
Time
Communication Avoiding QR Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0R R
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D0
D1
D2
D3
D0
04/12/2023 10
Communication Avoiding QR Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0R R
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D0
D1
D2
D3
D0
04/12/2023 11
Communication Avoiding QR Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0R R
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D0
D1
D2
D3
D0
04/12/2023 12
Communication Avoiding QR Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0R R
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D0
D1
D2
D3
D0
04/12/2023 13
Communication Avoiding QR Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0R R
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D0
D1
D2
D3
D0
04/12/2023 14
PowerPack 2.0
15The PowerPack platform consists of software and hardware instrumentation.Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for QR Factorization
16dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLASmatrix size is very tall and skinny (mxn is 1,152,000 by 288)
PLASMA’s Communication Reducing QR FactorizationDAG based
MKL’s QR FactorizationFork-join based
LAPACK’s QR FactorizationFork-join based
PLASMA’s Conventional QR FactorizationDAG based