1
Graph Sparsification Approaches to Scalable Integrated Circuit Modeling and Simulations
Zhuo Feng
ICSICT, Oct, 2014
Design Automation Group
Acknowledgements: My PhD students Xueqian Zhao (MTU) and Lengfei Han (MTU)
2
Scalable SPICE-Accurate IC Simulations
+-
VinMp
Vref
Rf1
Rf2
Cout
Vout
Iout
Error Amp
Cur. Amp. Cf
If
IC
VG
VR VR
VRVR
Analog Circuit Blocks
Digital Circuit BlocksOriginal Circuit with
Analog and Digital Blocks
Motivation– Integrated circuit (IC) system that involves billions of transistors and
interconnect components needs to be accurately modeled and analyzed
Challenges in large-scale SPICE-accurate IC simulations– Computational cost grows rapidly with traditional direct solution methods
– Iterative solution methods need to be robust and efficient for general tasks
Power Delivery Network (PDN) w/ Embedded Voltage Regulators (VRs)
3
Background of SPICE Simulation Algorithms
Standard SPICE simulators rely on Newton-Raphson (NR) method– Step1: Linearize the nonlinear devices (transistors, diodes, etc)
– Step 2: Update the solution through NR iteration
( ) , ( )k kk k
x x
f qG x C xx x
δ δδ δ
= =
( ) ( ( )) ( ( )) ( ) 0dF x f x t q x t u tdt
= + + =
Problem formulation– Nonlinear differential equations
– f(.) and q(.) denote the static and dynamic nonlinearities, respectively
Jacobian of F(x)
4
Prior Works
Direct and iterative solvers have been used in SPICE simulations– Direct solver: LU decomposition (KLU [1])
– Expensive for large-scale post-layout IC problems due to the exponentially increased memory and runtime cost
– Krylov-subspace iterative methods: GMRES [2]– Pros: black box solver, good memory efficiency, high parallelism– Cons: problem dependent convergence properties, worse runtime
– ILU and domain-decomposition based preconditioners, etc
References:[1] T. Davis, et al. Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Trans. Math. Softw., 2010.[2] Y. Saad, et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 1986.[3] D. A. Spielman, et al. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. ACM STOC, 2004.[4] M. Bern, et al. Support-graph preconditioners. SIAM J. Matrix Anal. Appl., 2006.
Our contribution: a circuit-oriented preconditioning approach– Novel circuit-oriented preconditioners (compared to matrix-oriented ones )
– Rigorous mathematic foundation: graph sparsification research [3-4]
– Consistent performance when solving transistor-level nonlinear circuits
5
Graph Sparsification Techniques Graph sparsification basics
– Find a subgraph P approximating the original graph G in some measure (pairwise distance, cut values, graph Laplacian, etc)
– Maintain the same set of vertices such that P can be used as a proxy for G in numerical computations w/o introducing much error
– A good graph sparsifier should keep very few edges to limit the computation and storage cost
Figure source: L. Koutis, G. L. Miller and R. Peng. A fast solver for a class of linear systems. Commun. ACM, 2012
G P
6
Support-graph preconditioner (SGP)– Example: find a spanning tree from the original graph
– Compute matrix factors w/o introducing any fill-ins for the spanning tree
The condition number of P-1G can be greatly reduced
1 2 3
4 5
1
987
6
42
4
6 5
49
8
1 3
3
1
2
3
4
5
6
7
8
9
2 0 1 0 0 0 0 02 4 0 3 0 0 0 00 4 0 0 8 0 0 01 0 0 6 0 4 0 00 3 0 6 5 0 1 00 0 8 0 5 0 0 30 0 0 4 0 0 9 00 0 0 0 1 0 9 40 0 0 0 0 3 0 4
dd
dd
dd
dd
d
Support-Graph Preconditioner
1
1
42
4
6 5
49
8
1 3
3
2 3
654
7 8 9
1
2
3
4
5
6
7
8
9
' 2 0 0 0 0 0 0 02 ' 4 0 0 0 0 0 00 4 ' 0 0 8 0 0 00 0 0 ' 6 0 4 0 00 0 0 6 ' 5 0 0 00 0 8 0 5 ' 0 0 00 0 0 4 0 0 ' 9 00 0 0 0 0 0 9 ' 40 0 0 0 0 0 0 4 '
dd
dd
dd
dd
d
Matrix 1st 2nd 3rd 4th 5th 6th condG 26.170 23.182 17.572 11.514 9.373 6.673 135.948P 25.239 23.540 17.579 10.909 9.865 6.822 16.752
P-1G 1.431 1.204 1.062 1.000 1.000 1.000 17.442
G P
7
A naïve support-circuit preconditioner (SCP)– Sparsifies the linear networks of the original circuit network
– Takes advantage of existing sparse matrix techniques (Cholesky, LU, etc)
– Nearly-linear complexity for analyzing nanoscale (parasitics-dominant) ICs– E.g. clock networks, power delivery networks, etc.
Support-Circuit Preconditioner
VR VR
VRVR
Digital Circuit Blocks
VR VR
VRVR
Support-Circuit Preconditioner
Support Graph of the Original Network
8
General-purpose support-circuit preconditioner (GPSCP)– Extracts sparsified network from the linearized circuit of the original circuit
– Leverages existing sparse matrix solution techniques
– Nearly-linear complexity for analyzing more general nonlinear circuit systems
Support-Circuit Preconditioner (Cont.)
Linearized Circuit
dsgdsCm gsg V
g
s
d
gsC
gdC
1g4g
3g 2g
5g
Nonlinear Circuit
dg
s
3R
4R 5R
1R
2R
dsgdsCm gsg V
g
s
dgdC
1g
3g 2g
5g
Support Circuit
9
Nonlinear Circuit
dg
s
3R
4R 5R
1R
2R
Support-Circuit Preconditioner Extraction (1)
Directed weighted graph corresponding to a linearized circuit – Can be obtained around an solution point during NR iterations
– Will be sparsified through graph decomposition and sparsification
Linearized Circuit
dsgdsCm gsg V
g
s
d
gsC
gdC
1g4g
3g 2g
5g1
Directed Weighted Graph
dsg dsChm gsg V
g
s
d
gsCh
gdCh
1g
2g3g
4g
5g
2 dsg dsCh
g
s
d
gsCh
gdCh
1g
2g3g
4g
5gUndirected Weighted Graph
3
Support Graph
dsg dsCh
g
s
dgdCh
1g
2g3g
5g
4
10
Controlling Sources
mgV
dsg dsCh
g
s
dgdCh
1g
2g
3g
5gSupport Graph
Support-Circuit Preconditioner Extraction (2)
Support-circuit preconditioner extraction– Combine support graph and other components (e.g. controlling sources)
– Factor the Jacobian matrix of the support circuit to create the preconditioner
dsg dsChm gsg V
g
s
dgdCh
1g
2g
3g
5g
Support Circuit
5
5
dsgdsCm gsg V
g
s
dgdC
1g
3g 2g
5g
6
Spt-CKT Spt-CKT
General-Purpose Support Circuit
7
11
Quality Quantification of Support Graph Preconditioners
Convergence of support-graph preconditioners– The convergence relies on the condition number of matrix pencil (G,P)
– The support of pencil (G,P) is defined as:
– Eigenvalues of pencil (G,P) are bounded by– A smaller means faster convergence
τ( , ) min | ( ) 0, all T nG P x P G x xσ τ τ= ∈ℜ − ≥ ∈ℜ
max
min
( , )( , )( , )G Pk G PG P
λλ
=
Spanning-tree support graph as a preconditioner– May require many iterations to converge if (mismatch) is too large
– can be estimated by comparing Joule heating of two resistive networks
Power dissipated by G:
Power dissipated by P:
Tx Gx
Tx Px
τ
ττ
12
Ultra-Sparsifier Support Graph (1)
Ultra-sparsifier (non-tree) support graphs– Ultra-sparsifier contains at most n-1+k edges (spanning tree + extra edges)
– It is k-ultra-sparse that -approximates the original graph with high probability [1]
– Adding extra edges to the spanning tree can better approximate the original graph (e.g. eigenvalues, power dissipations)
Spanning tree
Edges of spanning tree graph Extra edges
Ultra-sparsifier
[1] D. A. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proc. ACM STOC, 2004.
13
Ultra-Sparsifier Support Graph (2)
Sparsity control of an ultra-sparsifier support graph– Provides tradeoffs between the quality and efficiency of preconditioners
– Weighted degree of a vertex v in a graph A is defined:
– Example: for a 2D-mesh grid, 1 ≤ wd(v) ≤ 4– If wd(v) ->1: one dominant edge – If wd(v) ->4 : four evenly critical edges
( )
( )( )max ( , )u neighbor v
vol vwd vw u v∈
=
vol(v): total weight incident to node vw(u,v): the weight of the edge connecting nodes v and u
14
Ultra-Sparsifier Support Graph (3)
Iterative ultra-sparsifier support graph construction– Define θ as the matching factor threshold (0 < θ < 1) of node weighted degree
Step 1• Compute weighted degree wd of each node
in the original graph A
Step 2• Compute the support graph A’ with
weighed degree wd’
Step 3• Recover edges to A’ until wd’/wd > θ for
each node in the support graph A’
Step 4• Return the final ultra-sparsifier support
graph A’ for support-circuit preconditioningExtra edges
Ultra-sparsifierSpanning tree
wd’/wd < θwd’/wd > θ
15
Performance Model Guided Sparsification
Runtime performance model can help find the optimal θ– Which is better: a denser or sparser support graph?
tot GMRES LUT N T T= ⋅ +
LUTGMRESTN ⋅
Denser preconditioner
1. Greater LU factorization time2. Less GMRES iterations
LUT
GMRESTN ⋅
Sparser preconditioner
1. Less LU factorization time2. More GMRES iterations
Goal: minimize Ttot by finding a proper matching factor threshold θ !
Total Runtime:
16
Finding the Optimal Weighted Degree Threshold θ Optimal weighted degree threshold θ
– Exploit symbolic matrix factorization results to quickly identify optimal θ– E.g. find θ that maximizes the flops change of Cholesky factorizations
17
Performance Modeling Results
Experiments results of IBM power grid benchmarks
Runtime and flops vs. weighted degree threshold θ
Runtime results of manual and automatic sparsification schemes
18
Test Cases for Experiments
CKT # nunk # Mos # R # C # L # I
ldo1 3M 84K 6M 250K 7K 250K
ldo2 5M 71K 10M 422K 12K 422K
pg1 3M 144 6M 250K 7K 250K
pg2 6M 144 11M 490K 14K 490K
clk1 3M 65K 6M 3M - -
clk2 6M 65K 11M 6M - -
Circuit Design Parameters:• #nunk: number of unknowns in the circuits• #Mos: number of MOSFET• #R: number of resistors• #L: number of inductors• #C: number of capacitors• #I: number of current sources
Three Circuit Design Types:• ldo: large PDNs with on-chip VRs• pg: large PDNs with power gating• clk: clock distribution network
19
Results of Performance Model Guided Sparsification
Experimental results for a large PDN with multiple VRs– Performance guided sparsification approach achieve nearly-optimal runtime
Runtime of a single NR step using different θ
20
Experimental Results
CKT #NR Direct GPSCPTime (s) #GMRES Time (s) Speedup
ldo1 237 279,629 4,130 15,368 18X
ldo2 314 - 3,979 23,793 -
pg1 222 108,784 3,381 10,204 11X
pg2 421 185,892 3,478 14,206 13X
clk1 132 50,688 1,452 3,493 14X
clk2 219 112,497 2,555 8,001 14X
• Runtime comparison for transient analysis (100-time-step)
• Memory comparisonCKT Direct GPSCP
ldo1 4.2GB 0.8GB/5X
ldo2 - 1.1GB/-
pg1 3.2GB 0.8GB/4X
pg2 7.8GB 1.6GB/5X
clk1 4.3GB 0.8GB/5X
clk2 10.0GB 1.4GB/7X
21
Experimental Results (2)
A large PDN with embedded multiple VRs
22
RF Simulation Methods For nonlinear RF circuits, output is usually quasi-periodic
– SPICE may require simulating many periods to reach steady state
– Time-domain shooting method can not handle distributed devices Harmonic Balance (HB) analysis for steady-state RF simulation
– HB analysis can capture the steady-state spectral response directly
– Harmonic balance also refers to balancing the current between linear and nonlinear portions at every harmonic frequency
Output may containfreqs. other than 0ω
( )t0cos ω
NonlinearCircuit
+v−
v Freq Domain, MHz
dB
Time Domain (ps)
Volta
ge (v
)
23
HB Analysis of RF Circuits
Non-autonomous circuit analysis[1]
: state variables
: impulse response function of linear circuit components
: dynamic nonlinearities
: static nonlinearities
: time-dependent excitation sources
[1] K. S. Kundert and A. Sangiovanni-vincentelli. Simulation of Nonlinear Circuits in the Frequency Domain, CAD, 1986
( )x t
( )q
( )f
( )b t
( )y t
are typically periodic functions( ),x t ( ),q ( )f
24
HB Analysis of RF Circuits (2) HB Jacobian matrix (frequency domain)
– and represent the Fast Fourier Transform(FFT) and Inverse Fast Fourier Transform(IFFT) respectively
– G and C denote the linearization of q() and f() at s time domain sampled points, (s=2k+1, k is positive frequencies number)
– includes lots of dense blocks introduced by
1102 −− ΓΓ+ΓΩΓ+= GCfjYJhb π
∂∂
∂∂
∂∂
=
St
t
t
xq
xq
xq
C
2
1
∂∂
∂∂
∂∂
=
St
t
t
xf
xf
xf
G
2
1
−
=Ω
kI
kI
0
Γ 1−Γ
hbJ 1 1&C G− −Γ Γ Γ Γ
25
Challenges in Harmonic Balance (HB) Analysis Direct Methods for RF HB circuit simulation (A. Mehrotra et al, DAC’09)
– Challenged by solving large yet non-sparse Jacobian matrices– Cons: comp./memory cost grows quickly with circuit size
Traditional iterative methods for HB analysis (P. Feldmann et al, CICC’96, W. Dong et al, TCAD’09)
– Pros: black-box, matrix-oriented, memory-efficient– E.g. ILU preconditioner, domain-decomposition preconditioner
– Cons: inefficient/unreliable for strongly nonlinear RF systems
=Γ⋅⋅Γ −
12
1
21
1
GGG
GGGGG
G
s
s
s
=
sg
gg
G
2
1
TsGGG ],,,[ 21
Tsggg ],,,[ 21
FFT
Dense circulant matrices due to FFT/IFFT operations
26
From graph sparsification to Jacobian matrix sparsification– Modified nodal analysis (MNA) matrix reduction: 20% ~ 38% fewer entries
– Fill-ins during LU reduction: 60% LU factorization Speedup: 50X
Graph Sparsification Approach to HB Analysis
• • • • • ⇒• • • • • • •
MNA MatrixHB Jacobian Matrix
• × • • • × • × ⇒× × • × × • • × • • × × • × •
Fill-ins during LUBlock Fill-ins during LU
Before Graph Sparsification
• • • • ⇒• • • • •
MNA Matrix
HB Jacobian Matrix
• × • • • ⇒• × • • • × •
Fill-ins during LUBlock Fill-ins during LU
After Graph Sparsification
27
Conclusion Graph sparsification approaches to circuit simulations
– MNA matrix decomposition into Laplacian and Complement matrices
– Performance-guided graph sparsification of Laplacian matrix
– Support-circuit preconditioner construction
Our preliminary results– Highly reliable convergence for time/frequency domain simulations
– Up to 18X (21X) speedup and 7X (6X) memory reduction for time (frequency) domain simulations
– Scalable to large post-layout integrated circuits
Future work– Will explore spectral graph sparsification methods
– Will exploit heterogeneous CPU-GPU computing platforms
28
Nonlinear Devices Evaluation in HB
Evaluation of nonlinear devices Freq->Time: terminal voltage waveformsTime domain: evaluate current (derivative) waveformsTime->Freq: currents(derivatives) in freq. domain
Terminal voltage spectrum
IFFT/IAPDFT
Terminal voltage samples
Device evaluation Ids
samples
FFT/APDFT(Almost-Periodic DFT)
Ids spectrum
Terminal voltage samples– Need sampling at 2k+1 time points (k is the positive frequencies number)
according to Nyquist–Shannon sampling theorem.
29
Support-Circuit Preconditioner for HB Analysis Step 1: MNA matrix decomposition of linearized RF circuit
– Laplacian Matrix (P): passive devices such as resistors, capacitors, etc– Complement Matrix (A): active devices such as transconductances, etc
M1
L1
R1L2C2
C1
R2
RF Circuit
Linearized Circuit at t1
Linearized Circuit at ts
. . .
P t1
A t1
L1
R1L2C2
C1Cgd
Cgs gdsCgs
gmVgs
R2
1 23
4
5
L1
R1L2C2
C1Cgd
Cgs gdsCgs
gmVgs
R2
1 23
4
5
P ts
A tst1~ts are s time sampled time points
30
Support-Circuit Preconditioner for HB Analysis (2) Step 2: Representative Laplacian matrix construction
– Different sampled time points have different entry values– Normalize the scaled Laplacian matrices of all sampled time points
…
P t1 P t2 P ts
Representative Laplacian Matrix
Normalize Average
31
Support-Circuit Preconditioner for HB Analysis (3)
g1+C2/h
5
2
gds+Cds/h
C1/hCgd/h
31
4g2
Cgs/h
Representative Laplacian Matrix Original Weighted Graph Ultra Sparsifier
C1/hCgd/h
31
4g2
5
2
g1+C2/h
gds+Cds/h
Sparsified Representative Laplacian Matrix
Complement MatrixSparsification pattern Matrix
Step 3: Sparsification Pattern Extraction– Convert matrix to weighted graph– Sparsify the weighted graph and convert back to matrix form– Combine with the complement matrix
32
Support-Circuit Preconditioner for HB Analysis (4)
System MNA Matrix t1
Sparsification pattern Matrix
System MNA Matrix t2
System MNA Matrix ts
Sparsified SystemMNA Matrix t1
Sparsified system MNA Matrix t2
Sparsified system MNA Matrix ts
… …
Step 4: MNA Matrix Sparsification
33Support circuit preconditionerPermuted matrix
Circulant matrix in HB
Step 5: Support circuit block preconditioner generation– Original matrix : all variables of a single harmonic grouped together
– Permuted matrix: all the harmonics of a single variable grouped together
Support-Circuit Preconditioner for HB Analysis (5)
=Γ⋅⋅Γ −
12
1
21
1
GGG
GGGGG
G
s
s
s
=
sg
gg
G
2
1
TsGGG ],,,[ 21
Tsggg ],,,[ 21
FFT
Permutation FFT
Sparsified MNA matrix
34
Case Study : Double-balanced Gilbert Mixer MOSFET linearization model
[21]
[2]
[1] [8]
[16]
[25] [27]
[20] [7]
[15]
[13] [14]
[11] [18]
[22][17]
[4] [6]
M2M1
R7
M5
L1
L0
C0
Vlo+M3 M4
M6
R1
R3
R8
L2
R10
L3C1
R2
Vrf+ R5 Vrf-R6
Vlo-R4
VDD
[1] [8]
[21] [16]
[25] [27]
[20] [7]
[15]
[26]
[13] [14][11] [18]
[22][17]
[4] [6]
[2]
Linearized passive network (Laplacian matrix) extraction
RdsgmVgs gnVbs
D
S
G
B
Cgd
CgsG
B
S
D
[xx] denotes node index
35
Case Study : Double-balanced Gilbert Mixer (cont.) Ultra-sparsifier support graph construction
– Step 1: Extract maximum spanning tree
– Step 2: Restore critical edges until reaching a desired approximation
2
4 6
8 11
13 14
1 18
1621 17 22
25 27
2
4 6
8 11
13 14
1 18
1621 17 22
25 27
2
4 6
8 11
13 14
1 18
1621 17 22
25 27
Laplacian graph Maximum spanning tree Ultra sparsifier
36
HB Simulation Engine on CPU-GPU Platform
Device evaluation
Support-circuitpreconditioner
Preconditionerfactorization
GMRES iterations
Convergence checking
Start
End
NR
Decompose MNA matrix to Passive and active matrices
1. Performance modeling based sparsification configuration
2. Construct representative passive matrix
3. Extract sparsification pattern4. Sparsify MNA Matrix5. Generate Support-circuit
preconditioner
GPU-based block LU decomposition
Matrix-free iterative solver
37
Runtime Performance Modeling Lookup table (LUT) for runtime performance modeling
– 2D LUTs predict LU factorization runtime on GPU
– Two LUTs are created for GPU matrix multiplications and matrix divisions
Runtime performance lookup table for GPU-based matrix operations
Matrix operation batch size
Matrix size
Bilinear interpolation
38
Parallel Sparse Block LU Factorization Representative Sparsified MNA Matrix (test matrix)
– Approximates the properties of block sparse matrix– Created by averaging all sparsified MNA matrices– Factorized to get the fill-ins’ locations
…
Test matrix
Average
Sparsified SystemMNA Matrix t1
Sparsified system MNA Matrix t2
Sparsified system MNA Matrix ts
x
Fill-in
x
xx
x
LU L factor
U factor
39
Parallel Sparse Block LU Factorization (cont.) Data dependency graph
– Column k depends on column j, when U(j, k) != 0 [1]
– Can be derived from U matrix
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
0
02 1 0 6
4 5 3
7
8
9
Level 0
Level 1
Level 2
Level 3
Level 4
[1] J. Gilbert and T. Peierls. Sparse partial pivoting in time proportional to arithmetic operations. SIAM J. Sci. Stat. Comput., 9(5):862–873, 1988.
40
Parallel Sparse Block LU Factorization (cont.) Modified data dependency graph
– Identify “fake” dependency when L(j+1:n, j) == 0– Eliminate “fake” dependencies
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
0
0
2 1 0 6
4 5 37
89
Level 0
Level 1
Level 2
2 1 0 6
4 5 3
7
8
9
Level 0
Level 1
Level 2
Level 3
Level 4
41
Parallel Sparse Block LU Factorization (cont.) GPU-based block sparse
matrix LU factorizations– Levelize the factorization
according to data dependency graph
– Each level only contains matrix multiplication and division operations
– Use batched matrix multiplication and inversion functions provided by CUBLAS
2 1 0 6
4 5 37
89
Level 0
Level 1
Level 2
÷X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
÷X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
÷X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
…Level 0
Level n
Result
×X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
…
÷X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
÷X X XX X XX X X
X X XX X XX X X
X X XX X XX X X
…
× ×
…
42
Experiment Setup
Note:• Freqs: Number of harmonics• Nunk: Number of unknowns
CKT Name Nodes Tones Freqs Nunk1 mixer 1 302 2 25 147982 mixer 2 1988 2 41 1610283 mixer 3 5262 2 5 473584 mixer 4 7532 2 13 1883005 LNA + mixer 1 343 3 63 428756 LNA + mixer 2 5303 3 14 1431817 LNA + mixer 3 7573 3 14 204471
Widely used RF circuits as the benchmark
43
Support-circuit preconditioned HB (SCPHB) method– High robustness and efficiency
– Runtime speedup: 21X (compared with direct solver in DAC’09)
– Memory reduction: 6X (compared with direct solver in DAC’09)
Runtime and Memory Efficiency on CPU
CKTDirect solver BD preconditioner SCPHB preconditioner
Time(s) Mem(GB) Time(s) K-Its Time(s) Mem(GB) K-Its Speedup
1 471.9 0.23 24.9 821 145.5 0.10 204 3.24X
2 19263.1 7.95 5637.6 6731 1408 1.72 383 13.7X
3 686.4 0.36 92.2 165 69.5 0.06 229 9.8X
4 14153.5 4.26 1072.3 273 1035.6 0.73 355 21.3X
5 2561.6 1.92 DNF DNF 821.5 1 194 3.1X
6 4040.9 3.34 DNF DNF 414.7 0.67 328 9.74X
7 6633.6 5.21 DNF DNF 791 0.83 255 8.38X
K-Its : GMRES iteration number; DNF : Do not finish within 1000 Newton iterations
44
Simulation runtime VS. input power of LNA+Mixer– BD preconditioner: runtime increases exponentially
– SCPHB preconditioner: runtime remains nearly constant
Runtime Efficiency for Strongly Nonlinearities
45
Scalability Nearly-linear runtime and memory scalability
(a) Runtime scalability (b) Memory scalability