WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines
Wei Dong, Peng Li, Xiaoji Ye
Department of ECE, Texas A&M University{weidong, pli, yexiaoji} @ neo.tamu.edu
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines2
Multi-Core Implications Multi-core shift is changing the landscape of computing New challenges & opportunities for EDA
– Free ride of single-threaded EDA applications on Moore’s Law is coming to an end
Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling??
Courtesy Intel Courtesy AMD Courtesy IBM
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines3
Why Parallel Transient Simulation? SPICE-like transient simulation is key to wide ranges of ICs
– Memories, custom digital, analog/RF/mixed-signal
Long simulation time presents significant bottleneck in design – CPU time > days, weeks (e.g. transistor-level PLL simulation)
– Can lead to insufficient verification, non-optimal design, chip failure
Natural target for parallelization!
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines4
Prior Work Fine-grained parallelization
–Parallel matrix solves, devicemodel evaluations
–The efficiency of parallel matrix solvers deteriorates quickly
Parallel waveform relaxation[White et al ’87,Reichelt et al ICCAD’03]
–Limited convergence property
Domain decomposition[Wever et al, HICSS’96]
–Can create dense problems
–Applicability highly application dependent
1 2 3 4 5 6 7 80.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Cores/Threads
No
rma
lize
d R
un
time
SPD MatrixUnsymmetric Matrix
Performance of a public parallel matrix solver on a 8-processor server
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines5
Our Strategies Exploit coarse-grained & application-level parallelisms
– Lessons learned before [T. Mattson, Intel]
– >100 parallel languages/environments developed in the 90’s !
– Only a few with significant domain knowledge made successful
– Develop simulation algorithms parallelizable by construction
Goals/Benefits– Reduce parallel overhead via applying domain knowledge
– Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods)
– Ease in parallel programming, debug and code reuse
– Do not jeopardize accuracy & convergence
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines6
Proposed Approach Time-domain MNA formulation
How to parallelize along the time axis?
Data dependency
: vector of unknowns
: static nonlinearities
: dynamic nonlinearities
Nonlinear DAEs
: inputs
t1 t2 t3 t4 t5 t1 t2t3 t4 t5
One-step integration two-step integration
0)())(())(( tutxqdt
dtxf
)(tx))(( txf
))(( txq)(tu
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines7
Waveform Pipelining (WavePipe)
…Backward Pipelining …Forward Pipelining
Multi-Step Num. Integration Predictive Computing
TCurrent/base
Position
Granularity of WaveformPipelining
Schedule
T1 T2 T3 T4 …Solve
Fine GrainedParallel AssistsParallel Matrix Solve/Device Evaluation
Multi-/Many-Core Machine
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines8
Outline
Motivation
Overview
Parallel backward pipelining
Parallel forward pipelining
Experimental results
Summary
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines9
Parallel Backward Pipelining Move backwards in time Create additional independent computing tasks along T axis
Why useful?– Employ under variable-stepsize multi-step numerical integration
– Contribute to a larger future time step
……Backward Pipelining Forward Pipelining
Multi-Step Num. Integration Predictive Computing
TCurrentPosition
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines10
Variable-Stepsize Multi-Step Gear’s Method Gear’s integration formula
Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970]
p
kknknn xxx
11101
p
ixk ,0: order of numerical integration
: circuit response at time point i
: coefficients
11
21
1
21
11
111 )2()2(
)(
2
)(
nnnn
nn
nnn
nnn
nn
nnnn x
hhh
hx
hhh
hhx
hh
hhhx
0 1 2
nnn tth 11 1 nnn tth
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines11
Local Truncation Error (LTE) Numerical integration error incurred “locally” at each point
– All the previous solutions are assumed to be accurate
LTEs in Gear’s methods
31
21
21)3(
1
21
21
1 )2(
)()(
)2(6
)(DD
hh
hhhx
hh
hhh
nn
nnn
nn
nnnn
4323121
23
22
21)4(
323121
23
22
21
1 )()(
)(24DD
TTTTTT
TTTx
TTTTTT
TTTn
Two-step
Three-step
ktk
k
DDkdt
xd !
k
iin
nknkk
h
tDDtDDDD
11
111 )()(
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines12
LTE based Time Step Control (Gear2) Control the time step to meet an LTE tolerance
LTE’s dependency on hn & hn+1
Key observation
– Smaller hn greater hn+1:
if DD3 nonincreasing
– Exploit for parallel computing
bound 2 2
1 33
( 1),
(2 1)bound
n nn
k kh k h
k h DD
23 1 1 1 1 1
21 1
2 20
2
n n n n n n n n
n n n
DD h h h h h h h h
h h h
23 1 1 1
2
1
30
2n n n n n
n n n
DD h h h h h
h h h
T
?hn+1hn
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines13
Parallel Backward Pipelining Serial Gear2
Double-threaded Gear2
Balance between efficiency and robustness:
Extensible to multi-step methods (e.g. Gear3)
Initial conditions @ t1 & t2
Tr1: t3 (h3 h2)Tr2: back to t3’
Tr1: t4 (h4 h3’) Tr2: back to t4’
time
t1
t2
t3t3’
h2
h3
h4
h3’
t4t4’
h4’
Thread 1 Thread 2
10,' ii hh
t1
t2
t3h2
h3
h4
t4
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines14
Parallel Forward Pipelining Move forwards in time Exploit predictive computing along the forward T direction
Question– How to resolve data dependency & ensure accuracy
……Backward Pipelining Forward Pipelining
Multi-Step Num. Integration Predictive Computing
TCurrentPosition
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines15
Parallel Forward Pipelining Ex: double threaded
Init. condition @ t1 & t2
Time point t3 (h3 h2)
FE estimate sol@t3
Time point t4 (h4 h3)
Solve sol@t3 & sol@t4
Time point t5 (h5 h4)
FE estimate sol@t5
Time point t6 (h6 h5)
Solve sol@t5 & sol@t6
time
t1
t2
t3t4h2
h3
h5
h4
t5
t6
h6
Thread 1 Thread 2
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines16
Complications Time steps for forward points may not be estimated accurately
– Data dependency on initial conditions
– Apply a damping factor (β<1.0) for time step estimation
– Revoke forward results in thread scheduling cycle (covered later)
Forward points based on inaccurate initial conditions– Addressed by inter-thread communication
– Tradeoffs provided by fine/coarse grained communications
…Forward Pipelining
T
Base Position
h=?
…Forward Pipelining
T
Base Position
Accuracy?
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines17
Coarse Grained Inter-thread Communication
FE Estimation
Newton Loop
One or more iter.
Convergence
Time point 2Thread 2
time
FE Estimation
Newton Loop
One or more iter.
Convergence
Time point 1Thread 1time
…
FE Estimation
Newton Loop
One or more iter.
Convergence
Time point 3Thread 3
…
Iterate on the converged initial condition
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines18
Fine Grained Inter-thread Communication
time
Communicate at the granularity of NR iterations Beneficial to large circuits
FE Estimation
NR Iteration 1
Convergence
Time point 1Thread 1
Time point 2Thread 2
Time point 3Thread 3
time
NR Iteration 2
NR Iteration 3
FE Estimation
NR Iteration 1
Convergence
NR Iteration 2
NR Iteration 3
FE Estimation
NR Iteration 1
Convergence
NR Iteration 2
NR Iteration 3
…
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines19
Multi-threaded WavePipe Combine backward with forward waveform pipelining
Ex: 4T (1-backward-2-forward) WavePipe
T1T2
T3T4
Initial Solutions
… …Backward
Forward2nd Forward
Base Gear2 point
One Thread Scheduling Cycle
FE Newton
FE Newton
FE Newton
FE Newton
Time step
Time step
Time step
Time step
T2: backward
T1: standard
T3: forward
T4: 2nd forward
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines20
Thread Scheduling The work done over an overestimated step is discarded
Without Step Size Overestimation
Cycle Starts
Cycle Starts
Cycle Completes
Initial Conditions
…
Cycle Completes
…Time
BackwardForward2nd Forward
Standard 4-Thread WavePipe(1-backward-2-forward scheme)
With Step Size Overestimation
Cycle Starts
Cycle Starts
Partially Completes
Cycle Completes
……Time
Initial Conditions
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines21
Experimental Setup A 8-processor Linux server with four dual-core processors WavePipe implemented in C/C++ using pThreads (Gear2) Compare with
– Reference serial SPICE-like (Gear2) transient simulation– Low level parallel matrix solve (SuperLU) and device evaluation
Test circuits
Index Circuit Size Time Points Serial Run Time (s)
1 VCO 20 86,023 37.59
2 Power Amplifier 8 113,972 30.12
3 DB mixer 27 134,612 48.11
4 Ring Oscillator 61 110,037 206.37
5 Frequency Divier 17 44,795 18.49
6 Digital Adder 112 2,558 8.93
7 RLC mesh 1 13,097 664 2,704.08
8 RLC mesh 2 27,670 143 2,659.35
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines22
Experimental Results – Accuracy & Profiling 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer)
Real-time threading profiling (mesh ckt)
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines23
Experimental Results – 2T Speedups 2T 1-backward & 2T 1-forward
Circuit2T 1-backward 2T 1-forward
T(s) Speedup T(s) Speedup
VCO 27.3 1.38 23.1 1.63
Power Amplifier 21.7 1.39 18.1 1.66
DB mixer 36.9 1.30 30.8 1.56
Ring Oscillator 149.3 1.38 121.9 1.69
Frequency Divier 15.3 1.21 12.6 1.47
Digital Adder 7.3 1.22 6.0 1.49
RLC mesh 1 2245.1 1.20 1814.6 1.49
RLC mesh 2 2159.3 1.23 1742.2 1.53
1.29X 1.57X
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines24
Experimental Results – 3T Speedups 3T 1-backward-1-forward & 3T 2-forward
Circuit3T 1-back-1-forward 3T 2-forward
T(s) Speedup T(s) Speedup
VCO 20.3 1.85 19.6 1.92
Power Amplifier 16.2 1.86 15.4 1.96
DB mixer 27.6 1.74 26.3 1.83
Ring Oscillator 112.4 1.84 107.2 1.93
Frequency Divier 11.2 1.65 10.7 1.73
Digital Adder 5.4 1.65 5.1 1.75
RLC mesh 1 1679.6 1.61 1559.0 1.73
RLC mesh 2 1589.3 1.67 1487.4 1.79
1.73X 1.83X
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines25
Experimental Results – 4T Speedups 4T 1-backward-2-forward & 4T 3-forward
Circuit4T 1-back-2-forward 4T 3-forward
T(s) Speedup T(s) Speedup
VCO 16.8 2.24 16.1 2.33
Power Amplifier 13.8 2.18 13.2 2.28
DB mixer 22.7 2.12 21.6 2.23
Ring Oscillator 94.7 2.18 91.0 2.27
Frequency Divier 9.2 2.01 8.7 2.16
Digital Adder 4.5 2.03 4.2 2.13
RLC mesh 1 1390.2 1.95 1324.6 2.04
RLC mesh 2 1330.8 2.00 1265.4 2.10
2.09X 2.19X
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines26
Experimental Results – Runtime Scaling 2-4 threads
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines27
Experimental Results Low-level scheme
– Parallel matrix solve & device model evaluation Proposed scheme
– 1-4 threads: WavePipe– 8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines28
Summary Multi-core challenges & opportunities for EDA
Application-level coarse-grained parallelism for transient simulation
– Parallelize at a granularity of single time-point circuit solution
– Inherent low inter-core communication overhead
– Maintain accuracy & convergence
– Ease in implementation and code reuse
Rich sets of parallelisms for multi-core or many-core systems– New parallel opportunities orthogonal to fine-grained schemes
– Pair with parallel matrix solve, device evaluation and low-level parallel programming assists
DAC 2008
WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines29
Thanks