High-Performance Computing Using FPGAs || Accelerating the SPICE Circuit Simulator Using an FPGA: A...

Accelerating the SPICE Circuit SimulatorUsing an FPGA: A Case Study

Nachiket Kapre and Andre DeHon

Abstract Spatial processing of sparse, irregular, double-precision floating-pointcomputation using a single FPGA enables up to an order of magnitude speedupand energy-savings over a conventional microprocessor for the simulation programwith integrated circuit emphasis (SPICE) circuit simulator. We develop a parallel,FPGA-based, heterogeneous architecture customized for accelerating the SPICEsimulator to deliver this speedup. To properly parallelize the complete simulator,we decompose SPICE into its three constituent phases—Model Evaluation, SparseMatrix-Solve, and Iteration Control—and customize a spatial architecture for eachphase independently. Our heterogeneous FPGA organization mixes very largeinstruction word (VLIW), Dataflow and Streaming architectures into a cohesive,unified design. We program this parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available inthe SPICE circuit simulator using streaming (SCORE framework), data-parallel(Verilog-AMS models) and dataflow (KLU matrix solver) patterns. Our FPGAarchitecture is able to outperform conventional processors due to a combination offactors including high utilization of statically-scheduled resources, low-overheaddataflow scheduling of fine-grained tasks, and streaming, overlapped processingof the control algorithms. We expect approaches based on exploiting spatialparallelism to become important as frequency scaling continues to slow down andmodern processing architectures turn to parallelism (e.g. multi-core, GPUs) due toconstraints of power consumption.

N. Kapre (�)Nanyang Technological University, 50 Nanyang Avenue, Singaporee-mail: [email protected]

A. DeHonUniversity of Pennsylvania, Philadelphia, PA 19104, USAe-mail: [email protected]

W. Vanderbauwhede and K. Benkrid (eds.), High-Performance Computing Using FPGAs,DOI 10.1007/978-1-4614-1791-0 13, © Springer Science+Business Media, LLC 2013

389

mailto:[email protected]

mailto:[email protected]

390 N. Kapre and A. DeHon

1 Introduction

SPICE (Simulation Program with Integrated Circuit Emphasis) is an analog circuitsimulator used extensively to simulate and verify operation of silicon circuits. Itmodels the analog behavior of semiconductor circuits using a compute-intensive,nonlinear, differential equation solver. This can take days or weeks of runtime onreal-world circuits. SPICE is notoriously difficult to parallelize due to its irregularcompute structure and a sloppy sequential description [34]. It has been observedthat less than 7% of the floating-point operations in SPICE are automaticallyvectorizable [15].

Spatial parallelism provides a suitable model for constructing accelerators forchallenging problems like SPICE. It offers a natural way to express the hetero-geneous computational structure in SPICE and exposes the inherent parallelismavailable in the problem. Furthermore, modern FPGAs can be configured toefficiently support spatial parallelism with multiple floating-point operators coupledto hundreds of distributed, on-chip memories and interconnected by a flexiblerouting network. In Table 2, we observe that modern FPGAs can match and evensurpass the peak floating-point capacity of modern multi-core processors whiledissipating far less power. Spatial parallelism allows us to configure the FPGA todeliver a higher fraction of this floating-point peak through a combination of carefulstatic scheduling and low-overhead distributed processing.

As shown in Table 1, a SPICE simulation accepts a netlist description of thecircuit to be simulated along with input stimulus. The simulator then returns theresponse of the circuit in the form of output analog waveforms as shown in Fig. 1.The simulation algorithm discretizes circuit response and repeatedly solves circuitequations at each discrete step to generate output waveforms. We also show anabstract internal representation of the simulation algorithm in Fig. 2. This iterativesimulation consists of two key computationally intensive phases per iteration: ModelEvaluation ( 2© in Fig. 2) followed by Matrix Solve ( 3© in Fig. 2). This organizationallows the nonlinear, differential equation solver to be simplified to a system oflinear equations Ax = b which is handled in the Matrix Solve phase. The nonlinear,time-varying circuit elements are linearized using a Newton–Raphson loop anddiscretized using Trapezoidal integration in the Model-Evaluation phase. These twoloops are managed in the third phase of SPICE which is the Iteration Controller ( 1©in Fig. 2). A well-balanced, scalable, parallel architecture must accelerate all threephases of SPICE.

Table 1 Example SPICEnetlist

* R-C-D Circuit TopologyV1 1 0 PWL(0 1 1e-11 2 2e-11 3)R1 1 2 1D1 2 0 DNOMC1 2 0 10e-11.MODEL DNOM D (IS=1E-15 Vj=0.02 cjo=1e-9)

* SPICE Analysis options.TRAN 1e-12 3e-9.PLOT TRAN v(2).END

Accelerating the SPICE Circuit Simulator Using an FPGA: A Case Study 391

Table 2 Raw floating-point throughput and power (double-precision)

ChipTech.(nm)

Clock(GHz)

Peak GFLOPS(Double)

Power(Watts)

Intel Core i7 965 45 3.2 25 130Xilinx Virtex-6 LX760 40 0.2 26 20–30

0.8

1

a

b

0 1.5e-09 3e-09V

olta

ge (

V)

Time (s)

V(2)

Fig. 1 Example of a SPICE simulation. (a) Input circuit; (b) output waveform

Fig. 2 Flowchart of a SPICE simulator

This chapter reviews our previous research [19–22] that systematically solvesdifferent subproblems emerging from the complete SPICE acceleration challenge.


We now list the key themes of this chapter:

• We accelerate the Model-Evaluation phase of SPICE using a custom VLIWorganization overlayed on top of the FPGA [19]. We show scalability of ourapproach across different Verilog-AMS models [21]. This idea will be broadlyapplicable to other irregular, data-parallel problems that underutilize multi-coreCPUs or GPUs.

• We implement the Sparse Matrix-Solve phase of SPICE on an FPGA [20] usinga Token Dataflow architecture customized for Matrix-Solve. Token Dataflowdesigns are suitable for large-scale parallel computations that are either chal-lenging for static scheduling or structurally resolved at runtime.

• We design a hybrid, VLIW architecture for implementing the apparentlysequential fraction of the SPICE simulator, the Iteration Controller phase, on theFPGA [22]. We believe this solution is particularly important to avoid Amdahl’sLaw bottleneck once we have accelerated the compute-intensive portion of theapplication but still desire additional speedup and scalability.

• We integrate the different phases of SPICE together and outline a programmingmethodology and execution flow to use the accelerator for different simulations.Our approach highlights the benefits of avoiding an expensive per-circuit-instance compilation flow while still delivering the benefits of spatial parallelism

The rest of this chapter is organized as follows. We explain the underlying com-putational characteristics of SPICE and briefly explain FPGA-based implementationof computation in Sect. 2. Next, we discuss suitable FPGA compute organizationsfor implementing the three SPICE phases in Sects. 3, 4 and 5 respectively. In Sect. 6we provide details about the composed FPGA compilation framework and quantifythe performance and energy of complete SPICE accelerator. Finally we wrap upwith some key insights and lessons in Sect. 8.

2 Background

2.1 Summary of SPICE Algorithms

SPICE simulates the dynamic analog behavior of a circuit described by nonlineardifferential equations. SPICE solves the nonlinear differential circuit equations bycomputing small-signal linear operating-point approximations for the nonlinear andtime-varying elements until termination ( 1© in Fig. 2). We show an example R-C-Dcircuit topology and a corresponding transient simulation in Table 1. The linearizedsystem of equations is represented as a solution of Ax = b handled in the Matrix-Solve phase ( 3© in Fig. 2), where A is the matrix of circuit conductances, b is thevector of known currents and voltage quantities and x is the vector of unknownvoltages and branch currents. The simulator calculates entries in A and b from thedevice model equations that describe device transconductance (e.g. Ohm’s law forresistors, transistor I-V characteristics) in the Model-Evaluation phase ( 2© in Fig. 2).


Run

time/

Itera

tion

(s)

Circuit Size

a

b

N0.7

N1.2

sequential CPU parallel FPGA

105

105

106 107 109108

104

10-4

104

103

10-3

103

102

10-2

102

101

10-1

100

80486

Pentium

Pentium 2

Pentium 3

Pentium 4

Xeon 5160

Core i7

FLO

PS

Transistors

N0.96

Fig. 3 Scaling trends forFLOPS and runtime(spice3f5). (a) Sequentialruntime scaling of SPICEsimulator. (b) Peak FLOPSscaling of Intel CPUs

2.2 SPICE Performance Analysis

Since the SPICE simulation is an iterative algorithm, we can understand keycharacteristics of the complete simulation by analyzing a single iteration. In Fig. 3a,we show performance scaling trends for a single iteration of the SPICE solver fortwo scenarios. First we show data for sequential implementation of the open-sourcespice3f5 package on an Intel Core i7 965 across a range of benchmark circuitsshown later in Appendix. We also show data for our parallel FPGA implementationacross the same benchmarks. We observe that sequential runtime for one iterationscales as O(N1.2) as we increase circuit size, N, while parallel runtime scales fasteras O(N0.7). These trends have been previously reported in [34]. Our experimentsre-examine this claim on modern circuits and modern architectures and observethat they still continue to hold true. In Fig. 3b, we show the peak floating-pointscaling trends of Intel CPUs obtained from Intel datasheets to contrast againstSPICE runtime trends. We observe that the sequential CPU peak (FLOPS) havebarely scaled as O(N) while SPICE runtime has scaled super-linearly as O(N1.2).While Moore’s Law continues to deliver increasing circuit sizes (for both circuit


102030405060708090

100

Per

cent

of T

otal

Run

time

Circuit Size

55%

38%

7%

modelevalmatsolve

ctrl

105104103102

Fig. 4 Sequential runtimedistribution of SPICEsimulator

simulation and CPU processing), the CPU floating-point peaks have been unable tokeep up with the super-linear scaling rate of simulation times. This means there is awidening performance gap between CPU peak and SPICE runtime. In contrast, theFPGA processing capabilities shown in Table 2 can be organized entirely in parallelthereby allowing performance to scale as the critical latency of the computationO(N0.7) as shown in Fig. 3a.

To further understand SPICE performance trends, we break down the contri-bution to total SPICE runtime between the different phases in Fig. 4. We observethat Model-Evaluation and Sparse Matrix-Solve phases account for over 90% oftotal SPICE runtime across the entire benchmark set. For circuits dominated bynonlinear devices, Model-Evaluation phase accounts for as much as 90% (55%average) of total runtime since the runtime of this phase scales linearly withthe number of nonlinear devices in the circuit. Simulations of circuits with alarge number of resistors and capacitors (i.e. linear elements) generate largematrices and consequently the Sparse Matrix-Solve phase accounts for as muchas 70% of runtime (38% average). This phase empirically scales as O(N1.2) whichexplains the super-linear scaling of overall SPICE runtime. Finally, the IterationController phase of SPICE comprises a small but nontrivial fraction (≈7%) oftotal runtime. While this represents a small fraction of total runtime, once weaccelerate Model-Evaluation and Sparse Matrix-Solve phases, it can become anAmdahl’s Law bottleneck limiting overall application speedup. Thus, our parallelFPGA architecture must parallelize all three phases of SPICE.

2.3 SPICE Model-Evaluation

In the Model-Evaluation phase, the simulator computes conductances and currentsthrough different elements of the circuit and updates corresponding entries in thematrix with those values. For resistors this needs to be done only once at the start ofthe simulation. For nonlinear elements, the simulator must search for an operating-point using Newton–Raphson iterations that requires repeated evaluation of the


Ope

ratio

ns

Late

ncy

Non-Linear Circuit Elements

N1

N0

work latency

105104103102101

105

104

103

102

101

108

107

106

105

104

Fig. 5 Work-vs-latency ofmodel-evaluation phase

model equations and a linear solve multiple times per time-step as shown by theinnermost loop in step 1© of Fig. 2. For time-varying components, the simulatormust recalculate their contributions at each timestep based on voltages at severalprevious timesteps in the outer loop in step 1© of Fig. 2.

In Fig. 5, we plot the number of floating-point operations and the latency of evalu-ation (floating-point operations along critical path from input to output) as a functionof the number of nonlinear elements in the circuit. Since each device contributes afixed number of floating-point operations per instance, we see a linear growth in thenumber of operations. However, the latency of evaluation stays constant since eachevaluation is completely independent and can be processed simultaneously. Thishighly data-parallel computation is suitable for implementation on FPGAs, GPUs,as well as multi-cores.

2.4 SPICE Matrix Solve (Ax = b)

Modern SPICE simulators use modified nodal analysis (MNA) [5] to assemblecircuit equations into the matrix A. This generates highly sparse, asymmetricmatrices which are processed using sparse, direct LU factorization techniques todeliver robust simulation results. Our approach uses the state-of-the-art KLU matrixsolver [35] optimized for SPICE circuit simulation and avoids per-iteration changesto the matrix structures. The static nonzero pattern enables reuse of the matrixfactorization graph across all SPICE iterations and allows us to perform a one-timedistribution of computation across a parallel architecture. The solver reorders thematrix A to minimize fillin using block triangular factorization (BTF) and columnapproximate minimum degree (COLAMD) techniques. It then uses the left-lookingGilbert–Peierls [13] algorithm to compute the LU factors of the matrix column-by-column such that A = LU . Finally, it calculates the unknown x using Front-SolveLy = b and Back-Solve Ux = y operations.


Ope

ratio

ns

Late

ncy

Circuit Size

N1.4

N0.7

work latency

105 106104103102101

105

104

103

102

106

105

104

103

107

106

Fig. 6 Work-vs-latency ofsparse matrix-solve phase

In Fig. 6, we plot the number of floating-point operations in the factorization andlatency of evaluation as a function of the size of the circuit. We observe that the num-ber of floating-point operations in the Matrix-Solve computation scale as O(N1.4)while the latency of the critical path through the compute graph scales as O(N0.7).This suggests a parallel potential of O(N0.7) which can be realized by distributingthe dataflow graph across ideal parallel hardware (e.g. no communication delays,perfect distribution, unlimited internal processing bandwidth).

2.5 SPICE Iteration Controller

The SPICE iteration controller shown in Fig. 2 is responsible for two kinds ofiterative loops: (1) inner loop: Newton–Raphson linearization iterations for non-linear devices and (2) outer loop: adaptive time-stepping for time-varying devices.The Newton–Raphson algorithm is responsible for computing the linear operating-point for the nonlinear devices like diodes and transistors. Additionally, an adap-tive time-stepping algorithm based on truncation error calculation (Trapezoidalapproximation, Gear approximation) is used for handling the time-varying deviceslike capacitors and inductors. The controller implements customized convergenceconditions and local truncation error estimations that determine how the transientanalysis state machines are advanced at runtime in a data-dependent manner.The state-machine and breakpoint-processing logic are highly data-dependent anddetermine the total number of SPICE iterations required for the complete simulation.

As we saw earlier in Fig. 4, the Iteration Control phase only accounts for ≈7% oftotal sequential runtime. However, our parallel SPICE implementation takes careto efficiently implement this portion to avoid an Amdahl’s Law bottleneck. Weshow the danger of ignoring this phase for parallelization in Fig. 7 which shows theruntime breakdown for the r4k netlist in different implementation scenarios. Weobserve that we can get a speedup of ≈6× when parallelizing the Model-Evaluation


0

5

10

15

20

25

30

35

40

FullySequential

CPUIterCtrl

MicroblazeIterCtrl

VLIWIterCtrl

Run

time/

Itera

tion

(mill

isec

onds

)

Iteration ControlModel-Evaluation

Matrix-Solve

Fig. 7 Parallel potential foriteration control (r4k netlist)

and Sparse Matrix-Solve phase of SPICE (parallel FPGA runtimes obtained fromSect. 7). If we parallelize the Iteration Control phase, we can improve overallspeedup to ≈9×. The Iteration Control phase of SPICE is dominated by data-parallel operations in convergence detection and truncation error-estimation whichcan be described effectively in a streaming fashion. The loop management logic forthe Newton–Raphson and Timestepping iterations is control-intensive and highlyirregular. We can capture both these computational structures effectively using astreaming framework.

2.6 Promise of FPGAs

We briefly review the FPGA architecture and highlight some key characteristics ofan FPGA that make it well suited to accelerate SPICE. A Field-Programmable GateArray (FPGA) is a massively parallel architecture that implements computationusing hundreds of thousands of tiny programmable computing elements calledk-LUTs (k-input lookup tables that can implement any boolean function of kinputs, typically k = 4–6) connected to each other using a programmable bit-level communication fabric. An FPGA allows us to configure the computationin space rather than time and evaluate multiple operations concurrently in afine-grained fashion. In Fig. 8, we show a simple calculation and its conceptualimplementation on a CPU and an FPGA. For a CPU implementation, we process theinstructions stored in an instruction memory temporally on an ALU while storing theintermediate results (i.e. variables) in a data memory. Thus, a single evaluation ofthe graph takes several CPU cycles. On an FPGA, we can implement the operationsas pipelined spatial circuits while implementing the dependencies between theoperations physically using pipelined wires instead of variables stored in memoryto get high performance. This allows the FPGA mapping to start a new evaluation ineach cycle delivering higher throughput than the CPU. Modern FPGAs also include


Fig. 8 Implementing computation

hundreds of embedded memories distributed across the fabric that deliver 10–100×higher on-chip bandwidth compared to a processor [10]. The spatial FPGA fabriccan be configured to implement hundreds of specialized datapaths connected tohigh-bandwidth on-chip memories thereby providing a higher overall throughputcompared to modern multi-core processors. This potential provides a foundation forcustomizing a specialized SPICE accelerator using the FPGA fabric.

2.7 Historical Review

We now review the various studies and research projects in the past three-and-a-half decades that have attempted to build parallel SPICE systems. Some of thesestudies accelerate SPICE by devoting expensive hardware resources to squeezeadditional performance while others reorganize the computation to use lower-precision evaluation that is easier to parallelize. Our approach expands on certainideas from the past while delivering a cheaper, SPICE-accurate accelerator.

We can refine the classification of parallel SPICE approaches by consideringunderlying trends and characteristics of the different systems as follows:

1. Compute Organization: We see parallel SPICE solvers using a range of differ-ent compute organizations including conventional multi-processing, multi-core,VLIW, SIMD, and Vector.

2. Precision: Under certain conditions, SPICE simulations can efficiently modelcircuits at lower precisions.

3. Compiled Code: In many cases, it is possible to generate efficient instance-specific simulations by specializing the simulator for a particular circuit.

4. Numerical Algorithms: Different classes of circuits perform better with a suitablechoice of a matrix factorization algorithm. Our FPGA design may benefit fromnew ideas for factoring the circuit matrix.


One of the early parallel SPICE designs Awsim-3 [27, 28] uses a compiled codeapproach and a special-purpose system with lower-precision, table-lookup Model-Evaluation (Compiled Code, Precision) to provide a speedup of 560× over aSun 3/60. However, a bulk of these speedups are due to dedicated hardwarefloating-point unit since the Sun 3/60 implements floating-point in software (tens ofcycles/operation). Additionally, table-lookup approximations result in a simulationwith accuracy trade-offs. A message-passing, parallel SPICE implementation [16]on an expensive, 40-node SGI Origin 2000 supercomputer (MIPS R10K processors)was able to speed up SPICE for certain specialized benchmarks by 24× (ComputeOrganization). More recently, in [25], a multi-threaded version of SPICE isdeveloped using PThreads. It achieves a speedup of 5× using 8 SMP (SymmetricMulti-Processors) on a small benchmark set which is amenable to parallel matrixfactorization (Compute Organization). GPUs have been used to speed up the data-parallel Model-Evaluation phase of SPICE by 50×[1] (double-precision on ATIGPU) or 32×[14] (lower-accuracy, single-precision on NVIDIA GPU) but can onlyaccelerate the SPICE simulator in tandem with the CPU by 3× for the 2-chip GPU-CPU processing system (Compute Organization). Recent approaches [37] have usedcoarse-grained domain-decomposition techniques to parallelize SPICE by 31×–870×(mean 119×) across a 32 processor grid at SPICE-level accuracy (NumericalAlgorithms).

FPGAs have traditionally enjoyed limited use for accelerating SPICE due tolimited logic capacity of older FPGA families and lack of tools and methodologyfor attacking a problem of this magnitude. A compiled code, partial evaluationapproach for timing simulation (lower precision than SPICE) using FPGAs wasdemonstrated in [42] where the processing architecture was customized for eachSPICE circuit using fixed-point computation (Compiled Code, Precision). OurFPGA-based approach accelerates the SPICE computation while retaining theaccuracy of spice3f5 and developing an economical single-FPGA system foraccelerating SPICE. We reuse the idea of compiled-code methodology pioneeredby many previous approaches. We can compose our technique with KLU-baseddomain-decomposition approaches [37] to scale to even large problems and systemsizes e.g. multi-FPGA systems. Additionally, we can integrate lower-precisiontechniques (e.g. table lookup) into our mapping flow to get cumulative benefits.

3 Model Evaluation

In this section, we show how to compile the nonlinear differential equationsdescribing SPICE device models using a high-level, domain-specific frameworkbased on Verilog-AMS. This approach is broadly applicable to other HPC workloadswith a dataflow compute kernel e.g. mathematical expressions which can beevaluated in parallel in dataflow fashion. We sketch a hypothetical fully spatialdesign that distributes the complete Model-Evaluation computation across the chipas a configured circuit to achieve the highest throughput. We then develop a


Table 3 Diode Verilog-AMS equations (left), dataflow graph (right)

module diode (a, c);{

parameter real is=10f;parameter real vj=0.3;

inout a,c;electrical a,c;branch (a, c) ac;

I (ac) <+ is*(exp (V (ac)/vj) - 1);}

is

I

V vj

−

ex1

/

*

realistic spatial organizations that can be realized on a single FPGA using staticallyscheduled time-multiplexing of FPGA resources. This allows us to use less areathan the fully spatial design while still achieving high performance. Our automatedcompilation and tuning approach can scale the implementation to larger system sizeswhen they become available.

3.1 Structure

As discussed earlier, the Model-Evaluation phase has high data parallelism con-sisting of thousands of independent device evaluations each requiring hundreds offloating-point operations. Additionally, we make other structural observations thatwill help simplify and enhance our FPGA mapping. We note that there is a limiteddiversity in the number of nonlinear device types in a simulation (e.g. typicallyonly diode and transistorsmodels). There is high pipeline parallelism withineach device evaluation as operations can be represented as an acyclic feed-forwarddataflow graph (DAG) with nodes representing operations and edges representingdependencies between the operations. These DAGs are static graphs that are knownentirely in advance and do not change during the simulation enabling efficientoffline scheduling of instructions. Individual device instances are predominantlycharacterized by constant parameters (e.g. Vth, Temperature, Tox) that are determinedby the CMOS process leaving only a handful of parameters that vary from device todevice (e.g. W, L of device). This specialization potential in the form of constant-folding, identity simplification and other compiler optimizations can eliminate70–80% of repeated, unnecessary work.

We compile the device equations from a high-level domain-specific languagecalled Verilog-AMS [26] which is more amenable to parallelization and optimiza-tion than existing C description in spice3f5. We show a simple code example forthe diode in Table 3. In contrast to Verilog-AMS, the spice3f5 C descriptionsmake extensive use of pointers into shared data-structures that are harder to analyze


Table 4 Device model instruction counts

Instruction distribution (optimized)

Model Add Mult. Divide Sqrt. Exp. Log. Rest

bjt 22 30 17 0 2 0 8diode 7 5 4 0 1 2 9jfet 13 31 2 0 2 0 8mos1 24 36 7 1 0 0 21vbic 36 43 18 1 10 4 9mos3 46 82 20 4 3 0 38hbt 112 57 51 0 23 18 60bsim4 222 286 85 16 24 9 137bsim3 281 629 120 9 8 1 117mextram 675 1,626 397 22 52 37 238psp 1,345 2,319 247 30 19 10 263

(Rest includes mux, bool and integer operations)

and do not provide a clean way to separate variables from constants. The Verilog-AMS compilation also allows us to capture the device equations in an intermediateform suitable for performance optimizations and parallel mapping to several targetarchitectures. We use open-source Verilog-AMS nonlinear models from Simucadranging from the small, simple diode model to the large, complex bsim3, pspmodels.

Our Verilog-AMS compiler generates a generic feed-forward dataflow graph (seediode example in Table 3 of the computation that is processed by the backend tools).The compiler currently performs simple dead-code elimination, mux-conversion,constant-folding, identity simplification and common-subexpression eliminationoptimizations. We tabulate the optimized instruction counts for the different devicemodels in Table 4.

3.2 Fully Spatial Architecture

A spatial circuit implementation of computation is a straightforward embodiment ofa dataflow graph on an FPGA. Such a circuit contains physical operators for everyinstruction in the dataflow graph and uses physical wires to implement dependenciesbetween the instructions. These operators can evaluate in parallel and communicateresults directly using the programmable FPGA interconnect. Furthermore, if thecomputation is data parallel, we can exploit pipeline parallelism by adding a suitablenumber of registers along the wires to balance dataflow. This will then permit usto start a new evaluation of the dataflow graph in each cycle and deliver resultsof the computation after the pipeline latency of the graph. This pipelined, spatialcircuit implementation of data-parallel computation will deliver the highest possibleperformance for our Model-Evaluation computation. In contrast, a conventionalvon-Neumann architecture (e.g. Intel CPUs) will implement this computation by


Table 5 Estimated speedup(vs. Intel Core i7 965) andFPGA costs (Virtex6 LX760)of multi-FPGA designs

Device modelsTotalspeedup

FPGAsrequired

Speedupper FPGA

bjt 14 1 14diode 34 1 34jfet 17 1 17mos1 14 1 14vbic 17 1 17mos3 12 1 12hbt 62 3 20mextram 204 18 11bsim3 47 6 8bsim4 69 4 17psp 155 21 7

fetching a binary representation of computation stored in memory. The binaryimplicitly encodes the dataflow structure using a sequence of instructions thatcommunicate results using registers (i.e. memory). The dataflow parallelism hiddenin this implicit encoding must be rediscovered by the von-Neumann architecture inhardware, often limiting the amount of parallelism that can be exploited from thedataflow graph.

Ideal Mapping We can imagine implementing the data-parallel operations inModel-Evaluation as a pipelined dataflow circuit on the FPGA. If cost is not aconcern, this approach provides up to two orders of magnitude speedup over animplementation using Intel Core i7 965 microprocessor when using a Xilinx Virtex-6 LX760 FPGA (see Table 5). We compute a lower-bound on the number of FPGAsrequired to implement the dataflow graph based on total operator area (ignoringFPGA external IO limitations and pipelining area costs). This model provides alower-bound on cost and an upper-bound on the speedup possible with the spatialapproach. For the designs that fit in a single FPGA, this model only needs to berefined with pipelining costs and can avoid the complexities of the multi-FPGAdistribution. A single-FPGA, fully spatial implementation of all devices will beeventually possible with the increasing FPGA densities made possible by Moore’sLaw. From Table 5, the bsim3 model currently requires only 6 Virtex-6 LX 760FPGAs to fit. This means an FPGA that is 4× denser will fit the complete deviceevaluation graph. This FPGA will become possible two technology nodes into thefuture at 22 nm (Virtex-6 is manufactured at the 40 nm technology node).

3.3 Custom VLIW Architecture

In the previous section, we saw how fully spatial implementations (circuit-styleimplementation of dataflow graphs) are too large to fit on current FPGAs. Hence,computation must be time-shared over limited FPGA resources. These graphs


Fig. 9 Custom VLIW organization

contain a diverse set of floating-point operators such as adds, multiplies, divides,square-roots, exponentials, and logarithms. We map these graphs to custom VLIWprocessing tiles with spatial implementation of the floating-point operators.

Pipelined, spatial FPGA implementations of elementary functions like exp andlog operate at a high throughput of one evaluation/cycle (250 MHz) while the pro-cessor implementations require 100s of instructions (10–20 cycles at 3 GHz) [17].Additionally, we support these spatial operators by coupling them to local, dis-tributed, high-bandwidth memories, as shown in Fig. 9, which is not possible withfixed-function CPUs or GPUs. We statically schedule these resources offline inVLIW [12] fashion and perform loop-unrolling, tiling and software pipeliningoptimizations to improve performance.

Each tile in the time-shared architecture consists of a heterogeneous set offloating-point operators coupled to local, high-bandwidth memories and intercon-nected to other operators through a communication network. We use a time-multiplexed fat tree [23] to connect these operators which allows us to tunethe interconnect bandwidth to match communication requirements between theoperators. A time-multiplexed switch in this architecture consists of a multiplexerfor each IO port with a small context memory for storing the routing instructioni.e. select bits for the multiplexers. Thus, a VLIW instruction for the completetile is a combination of read/write addresses for local on-chip memories, addresscontrol bits, datapath multiplexer controls and switch route decisions for eachstatically scheduled cycle of operation. We develop a Verilog-AMS compiler for thenonlinear device models that is capable of recognizing a suitable subset of the lan-guage specification while performing useful optimizations such as constant-folding,


Table 6 Parallel software environments

Arch. Compiler Libraries Timing

Intel CPUs gcc-4.4.3(-O3) OpenMP 3.0 [7] PAPI 4.0.0 [33]Xilinx FPGA Synplify Pro 9.6.1, Coregen [43] –

Xilinx ISE 10.1 Flopoco [8] –

if-mux conversion, and dead-code elimination. Our VLIW scheduling frameworkfirst chooses an operator mix per tile proportional to the frequency of occurrence offloating-point operations in the graph generated by the Verilog-AMS compiler. Forexample, a bsim3VLIW tile contains 1 add, 3 multiplies and 1 each of divide, sqrt,log and exp operators while a bsim4 tile contains 2 adds, 2 multiplies and 1 each ofdivide, sqrt, log and exp. We then partition the floating-point operations across theheterogeneous set of operators using MLPart [3]. Finally, we assign each operationto a specific cycle on the datapath and perform a static route on the time-multiplexednetwork using a greedy LPT (longest processing time first) scheduler to generate theVLIW instruction context. For a nonlinear device model, we configure the FPGAwith multiple tiles in SIMD-like (Single Instruction Multiple Data) manner whereeach VLIW instruction is SIMD broadcast to all tiles.

3.4 Experimental Setup

In our experimental flow, we compare the performance of the Intel Core i7 965quad-core CPU with a Xilinx V6 LX760 FPGA. We measure runtime averagedacross thousands of device evaluations. We map the data-parallel model equationsto the CPU and FPGA with the help of software frameworks listed in Table 6. Totarget these architectures we use a combination of automated code-generation andauto-tuning to generate optimized implementations across these different systems.Our code-generator writes out multiple configurations of parallel code for theCPU and the FPGA-VLIW architecture based on architecture-specific templates.For the CPU, we perform loop-unrolling and generate vector instructions forcertain operations to optimize performance. For the FPGA, we can choose thenumber of datapaths (PEs) and the richness of the BFT interconnect between thesedatapaths (PEs) to tune performance. Our auto-tuner exhaustively explores severalimplementation parameters e.g. loop unroll factor for the different architecturesas shown in Table 7. We also show the range of possible values taken by theseparameters as well as the increment step for the exploration. Such an exhaustiveexploration is possible in our case since the Model Evaluation graphs are completelyknown in advance and the design space is small. This framework is also capable oftargeting GPUs and other multi-core devices [21].


Table 7 Auto-tuning parameters

Architecture Parameter Range Increment

Intel Loop-unroll factor 1–5 +1Threads 1–8 +1MKL vector True/false

FPGA Loop-unroll factor 1–15 +1Operators per PE 8–64 ×2BFT rent parameter 0.0–1.0 +0.1

5 x

10 x

15 x

20 x

25 x

bjt

diode

jfet

mos1

vbic

mos3

hbt

bsim4

bsim3

mextram

psp

Spe

edup

6.5x mean

Fig. 10 Speedups formodel-evaluation

3.5 Results

In Fig. 10, we compare the performance achieved by Model-Evaluation implemen-tations between a quad-core Intel Core i7 965 (loop-unrolled and multi-threaded)and a Xilinx Virtex-6 LX760 FPGA (loop-unrolled, tiled and statically scheduled).We observe speedups between 1.4×–23× (geomean 6.5×) across our nonlineardevice model benchmarks. We deliver these speedups due to higher utilization ofstatically-scheduled floating-point resources (up to 70%), explicit routing of graphdependencies over physical interconnect and spatial implementation of elementaryfloating-point functions. The FPGA is able to achieve higher speedups for smaller,simpler nonlinear devices e.g. diode, bjt since they require smaller interconnectswitch programming context and a lower memory footprint to store the intermediatevalues in the evaluation.

3.6 Future Work

We now identify additional opportunities for improving the performance of theparallel FPGA design of the Model-Evaluation phase.


1. Double-precision floating-point operators consume a large amount of area onFPGAs. Custom floating-point or fixed-point operators that operate at justenough precision might provide an opportunity for improving the computedensity on FPGAs. We can redesign the Model-Evaluation datapaths with lowerprecision by adapting existing techniques [29] to obtain additional speedup. Wedemonstrate some preliminary results using this technique for small devices [30].

2. Additionally, we can improve the performance of the Model-Evaluation phasewith extra loop-unrolling and the use of off-chip memory capacity.

4 Sparse Matrix Solve

In Sect. 2, we identified the Matrix-Solve phase of the SPICE circuit simulator asthe most challenging phase for parallelization. Large-scale, sparse matrix factor-ization is also a commonly-used HPC kernel. The computation is characterizedby sparse, irregular operations that are too fine-grained to be effectively exploitedon conventional architectures (e.g. multi-cores). In this section, we show how toparallelize Sparse Matrix Solve using a combination of the KLU algorithm (bettersoftware) and an efficient dataflow FPGA architecture (better hardware). We startby introducing the KLU algorithm that extracts the exact compute graph of matrixfactorization and then describing a dataflow architecture for efficiently mapping thecompute graph to an FPGA.

4.1 Structure

The SPICE simulator spice3f5 assembles the sparse circuit left-hand side (LHS)matrix and the right-hand side (RHS) vectors in Ax = b using the MNA approach.Since circuit elements tend to be connected to only a few other elements, the MNAcircuit matrix is highly sparse (except high-fanout nets like power lines, etc). Theunderlying nonzero structure of the matrix is defined by the topology of the circuitand consequently remains unchanged throughout the duration of the simulation. Asdiscussed earlier, the KLU matrix solver performs a one-time partial pivoting atthe start of the simulation to deliver a static compute graph that can be efficientlydistributed and evaluated in parallel. This static pattern is reused across all SPICEiterations and enables us to generate a specialized static dataflow architecture thatprocesses the graph in parallel. The KLU Gilbert–Peierls algorithm has irregular,fine-grained task parallelism during LU factorization. We will now look at thepseudocode for the computation to understand the nature of parallelism availablein the algorithm.


(Sparse L-Solve)(Gilbert-Peierls)

1%-------------------------------2% input: sparse matrix A3% output: factored L and U4%-------------------------------5L=I; % I=identity matrix6for k=1:N7b $=$ A(:,k); % kth column of A8x $=$ L \ b; % \ is Lx=b solve9U(1:k) $=$ x(1:k);10L(k+1:N) $=$ x(k+1:N) / U(k,k);11end;

Listing .1 Gilbert-Peierls Algorithm (A=LU)

1%-------------------------------2% input: matrix L (1:k-1)3% output: kth column of L4%-------------------------------5x=b;6% symbolic analysis predicts non-zeros7for i = 1:k-1 where x(i)$!$=08for j = i+1:N where L(j,i)$!$=09x(j) $=$ x(j) - L(j,i)*x(i);10end;11end;12% returns x as result

Listing .2 Sparse L-Solve (Lx=b,x=unknown)

In Listing .1, we illustrate the key steps of the factorization algorithm. It isthe Gilbert–Peierls [13] left-looking algorithm that factors the matrix column-by-column from left to right (shown in the figure accompanying Listing .1 by thesliding column k). For each column k, we must perform a sparse lower-triangularmatrix solve shown in Listing .2. The algorithm exploits knowledge of nonzeropositions of the factors when performing this sparse lower-triangular solve (the


Table 8 Sparse circuit matrix (left) and dataflow graph for LU factorization (right)

-

-

4,4

2,4

*

1,1

/

*

3,1

2,2

/

3,21,4

*

-

3,4

/

4,3 3,3

x(i) �= 0 checks in Listing .2). This feature of the algorithm reduces runtime by onlyprocessing nonzeros and is made possible by the early symbolic analysis phase. Itstores the result of this lower-triangular solve step in x (Line 8 of Listing .1). Thekth column of the L and U factors is computed from x after a normalization step onthe elements of Lk. Once all columns have been processed, L and U factors for thatiteration are ready. From the pseudo-code in Listings .1 and .2 it may appear thatthe matrix solve computation is inherently sequential. However, we can visualizethis computation as a dataflow graph by unrolling the loops from code listings. Weshow the dataflow graph corresponding to a small example matrix in Table 8. Fromthe dataflow graph, we observe that there are two forms of parallel structure in theMatrix-Solve factorization computation that we can exploit in our parallel design:(1) factorization of independent columns organized into parallel subtrees and (2)fine-grained dataflow parallelism within the column. We now describe our parallelarchitecture capable of exploiting this parallelism.

4.2 Token-Dataflow Architecture

The Sparse Matrix-Solve computation can be represented as a sparse, irregulardataflow graph that is fixed at the beginning of the simulation. We recognizethat static online scheduling of this parallel structure may be infeasible due tothe prohibitively large size of these sparse matrix factorization graphs (millionsof nodes and edges, where nodes are floating-point operations and edges aredependencies). Hence, we organize our architecture as a dynamically scheduledToken Dataflow [36] machine. This organization is capable of exploiting parallelismacross a sparse, irregular graph with fully decentralized, distributed control. Weautomatically generate the dataflow graphs for LU factorization as well as Fron-t/Back solve steps from symbolic analysis and evaluation of the sparse factorization


Fig. 11 Custom dataflow organization

in the KLU solver as shown in Fig. 18. Our parallel FPGA architecture, shownin Fig. 11, consists of multiple interconnected Processing Elements (PEs) eachholding hundreds to thousands of graph nodes. We partition our graph across thePEs thereby assigning several nodes to each PE in our parallel architecture. Weplace nodes on the PEs using MLPart [3] to exploit locality. Each PE can firea node dynamically based on a fine-grained dataflow triggering rule. This allowsparallel evaluation of multiple graph nodes which have received all their inputsas computation proceeds down the graph. The Dataflow Trigger in the PE keepstrack of ready nodes and issues operations when the nodes have received all inputs.Tokens of data representing dataflow dependencies are routed between the PEs overa packet-switched network. Each switch in the network is assembled using simplesplit and merge blocks as described in [23]. The switches implement dimension-ordered routing (DOR [11]) on a Bidirectional Mesh topology. The Send Logic inthe PE injects messages into the network for nodes that have already been processed.For very large graphs, we partition the graph and perform static prefetch of thesubgraphs from external DRAM. This is possible since the graph is completely feedforward. We show the performance possible with this architecture in Sect. 7.

4.3 Experimental Framework

In our Matrix-Solve experimental flow, we compare the performance of theoptimized CPU implementation with the FPGA dataflow architecture. We first usespice3f5 simulator with its Sparse 1.3 [24] solver to obtain a reference functional


0 x

1 x

10 x

s27m

ux8ringoscs298s344s349dacs444s386s510s52610stagess641s713s820s832s953s1196s1238s142320stagess149430stages40stages50stagesr4k

Spe

edup

3.2x mean

Fig. 12 Speedups for matrix-solve (vs. Core i7 965)

implementation for comparison. We then replace Sparse 1.3 with the new KLUsolver to measure optimized sequential performance. This forms our optimized CPUbaseline for performance comparison. For the FPGA mapping, we perform a cycle-accurate simulation of the Token dataflow architecture. For large graphs that donot fit in the onchip memories, we account for graph loading times over a DDR3memory interface. We report cycle counts from the simulation to compute speedups.We use a rich and diverse set of benchmark circuit-simulation matrices detailed laterin Appendix.

4.4 Evaluation

When we integrate the KLU matrix solver in spice3f5 instead of the defaultSparse 1.3 solver, we are able to speed up the software implementation by≈35% across our benchmark circuits. We achieve higher improvements for largerbenchmarks since the symbolic analysis overheads can be amortized easily forlarge matrices. We use this as our software baseline for comparing with the FPGAimplementation. In Fig. 12, we compare the performance of our FPGA architectureimplemented on a Virtex-6 LX760 with an Intel Core i7 965. We observe speedupsof 0.6–6.5× (geomean 3.2×) for the 25-PE FPGA mapping that devotes all FPGAresources for Matrix-Solve acceleration over a range of benchmark matrices. Forthe complete SPICE system, we can only fit a 9-PE system for Matrix-Solve asdiscussed in Sect. 6. Our FPGA implementation allows efficient processing of thefine-grained factorization operations which can be synchronized at the granularityof individual floating-point operations. To better understand the speedups we plotthe distribution of parallel runtime across the different steps of the Matrix-Solveimplementation in Fig. 13. We observe that performance is dominated by the cost ofloading the large dataflow graph from offchip memory. We may be able to reducethis overhead with better DRAM memory interfaces and higher on-chip capacity.


0

25

50

75

100

s27m

ux8ringoscs208s298s344s349s382dacs444s386s510s52610stagess641s713s820s832s953s1196s1238s1423s1488s149420stages30stages40stages50stagesr4k

Run

time

Dis

trib

utio

n (%

) LU FactorizationFront-Solve

Back-SolveMemory Load

Fig. 13 Parallel runtime distribution for matrix-solve

4.5 Future Work

We now identify some ideas for achieved higher performance for the parallel MatrixSolve design.

1. We need to explore newer domain decomposition [37] and associativereformulation [18] strategies for improved scalability of the bottleneck SparseMatrix-Solve phase of SPICE. With domain-decomposition, we can break upthe large matrix into multiple submatrices that can be solved independently andpossibly even distributed across multiple FPGAs.

2. Sparse matrix solve operations on large matrices can generate large dataflowgraphs with millions of nodes and edges. These large graphs are challenging todistribute across multiple PEs. We can accelerate the placement algorithm itselfusing parallelism to minimize the one-time setup cost of the parallel simulation.

3. Apart from these approaches, it may be useful to consider completely dif-ferent algorithms (iterative matrix-free fixed-point simulation [6] or constant-Jacobian [47]) for SPICE simulations that completely eliminate the need forperforming per-iteration matrix factorization.

5 Iteration Control

In Sects. 3 and 4, we discussed the two computationally intensive phases of theSPICE simulator. In this section, we explain how to implement the sequential,control-intensive SPICE state-machines. We caution against mapping this sequentialIteration Control computation to a lightweight embedded microcontroller (e.g. Xil-inx Microblaze, Altera NIOS) as it creates a performance bottleneck and decreasesoverall speedups. This is broadly true for any parallel application (includingHPC problems), with a minimal degree of sequential control that could becomea performance bottleneck. We discuss a streaming approach that will permit ahigh-level expression of the Iteration Control computation using the SCORE [4]


framework. Our FPGA organization uses a combination of static and dynamicscheduling to deliver balanced speedups for the integrated design.

5.1 SCORE Framework

We express the SPICE Iteration Control algorithms in a stream-based frameworkcalled SCORE [4] (Stream Computation Organized for Reconfigurable Execution).The SCORE programming model allows us to capture the SPICE iteration controlalgorithm at a high-level of abstraction and permits exploration of different imple-mentation configurations for the parallel SPICE solver. The streaming abstractionnaturally matches the processing structure of the control algorithms and the overallcomposition of the solver. However, the SCORE compute model was originallydesigned for rapidly reconfigurable, time-multiplexed FPGAs. Modern FPGAsoffer poor dynamic reconfiguration support and are unsuitable for the coarse-grained, dynamically reconfigurable implementation of SCORE. Consequently,we develop a new implementation model for SCORE based on resource-sharingand static scheduling. We adapt the backend flow from our Model-Evaluationinfrastructure described in Sect. 3 to support dataflow graphs generated from theSCORE description of the Iteration Control computation.

SCORE allows description of streaming applications using dynamic dataflow. ASCORE program consists of a graph of operators (compute) and segments (memory)linked to each other via streams (interconnect). Computation within an operatoris described as a finite-state machine (FSM). The operations within a state can bedescribed as a dataflow graph, while the state machine transitions are captured usinga state transition graph. This suits the control-intensive nature of the SPICE iterationcontrol algorithm.

We show the high-level SCORE representation of the SPICE Iteration Controllerin Fig. 14. We describe the control algorithms as SCORE operators and state-machines interconnected by streams. The stream connection allows pipelined,parallel evaluation of the different operators when possible. The white nodes inFig. 14 represent the state-machine and breakpoint logic. For calculating conver-gence and local truncation error (LTE), we stream voltages, currents and chargesthrough the operation graph for the respective equations. The gray nodes are thedata-parallel stateless nodes that calculate LTE and compute convergence as afunction of voltage x, current b and charge Q vectors. We represent the Model-Evaluation and Sparse Matrix-Solve phases of SPICE as black boxes. Internallythese are implemented differently using FPGA organizations described earlier.

In Table 9 we show the number of floating-point instructions and their types inthe different SCORE operators. These statistics are obtained from the optimizedoperation graphs generated by tdfc, the SCORE compiler. As expected, weobserve that the If-Mux, Comparison, and Boolean instructions constitute the bulk


converged

A,b

x,b

�

��

�

δnew

success

buffer

order

δold

mode breakpoint break

timespicestmc

nistmc

accept

converge

modeleval LTE

breakpoint

matrixsolve

x

Q

Fig. 14 High-level SCORE operator graph for spice3f5

Table 9 SCORE compiler optimized instruction counts for iteration control

Operator Add Mult. Divide Sqrt. If-Mux Cmp. Bool Rest Total

converge 7 1 0 0 6 5 1 0 20LTE 16 8 9 1 21 20 0 0 75breakpoint 95 2 1 0 110 76 35 11 330nistmc 2 0 0 0 8 7 5 2 24spicestmc 29 15 6 0 79 42 24 17 212Total 149 26 16 1 224 150 65 32 513

Column Rest includes floor, ceiling, and other special functions

Table 10 SCORE operator activation frequency for a simpleresistor–capacitor-diode circuit

Operator Total activations/iteration Percent of total

converge 1,088,465 64.394

LTE 601,076 35.560

accept 299 0.017

breakpoint 48 0.002

nistmc 152 0.009

spicestmc 262 0.015

of the control-intensive computation in this phase of SPICE. We also note that weneed only one SQRT floating-point operation and no other expensive elementaryfloating-point functions. In Table 10, we show the dynamic activation counts for thedifferent SCORE operators in the Iteration Control phase of SPICE. An activationis when a state within that SCORE operator gets fired. We observe that the LTE andConvergence calculation dominate the dynamic activation counts.


Cus

tom

VL

IW I

nstr

uction

Fig. 15 Hybrid VLIW organization for iteration control

5.2 Hybrid VLIW Architecture for Iteration Control

Traditionally, FPGA designs offload the sequential control portion of a spatial de-sign either to host CPUs or embedded Microblaze [45] controllers. Such techniquesare unsuitable for stand-alone accelerator systems (no host CPU) or double-precision floating-point computation (poor support on Microblaze). Hence, weconsider spatial designs that can implement this computation in the FPGA fabricdirectly. As discussed earlier, we observe that the computation is a combinationof (1) data-parallel convergence detection and truncation error calculation and (2)sparsely activated, control-intensive SPICE analysis state-machine logic. We ex-press this parallel structure using the streaming SCORE [4] framework and compilethis parallelism to a hybrid VLIW architecture. The underlying FPGA architectureis organized as tiles (one tile is shown in Fig. 15) interconnected through streams.Each tile is a collection of floating-point operators (limited to add, multiply, divideand square-root) that are internally connected with a time-multiplexed network.Each operator is managed by a hybrid controller that dynamically selects betweenstatically-scheduled configurations. The spatial mapping flow combines loop-unrolled, software-pipelined scheduling for data-parallel components like truncationerror calculation and convergence detection logic along with dataflow schedulingfor sparsely activated state-machine logic. The hybrid VLIW architecture is mostlysimilar to the Model-Evaluation design and we reuse its backend schedulingframework. The difference in this architecture is the support for limited dynamicprocessing. The data-parallel convergence detection, truncation error estimationoperations, and sparsely activated individual state computations in the SPICE


1.0 x

2.0 x

4.0 x

s27m

ux8ringoscs208s298s344s349s382dacs444s386s510s526ns52610stagess641s713s953s820s832s1196s1238s142320stagess149430stages40stages50stagesr4k

Spe

edup

1.8x mean

Fig. 16 Speedups for iteration-control (vs. Core i7 965)

analysis state-machines are statically scheduled in VLIW fashion to exploit dataflowparallelism. In contrast, the loop control state machine transition operations areevaluated dynamically using spatial implementation of state transition making thisa hybrid VLIW design that combines static and dynamic scheduling.

5.3 Experimental Framework

We compare the performance of different partitioning strategies for implementingthe Iteration Controller. We evaluate (1) CPU-FPGA (PCIe), (2) Microblaze-FPGAlogic, and (3) our proposed hybrid VLIW-FPGA logic partitionings. For the CPUbackend, we generate multi-threaded C++ code from the SCORE compiler [4, 9].This also allows us to perform a functional software verification with spice3f5.We use PAPI to measure the CPU runtime of the Iteration Control phase. For theMicroblaze backend, we develop a SCORE runtime customized for the Microblazesoft processor that enables stream operations. This is done through automated code-generation in a flavor of C suitable for use with an embedded operating systemrunning on the Microblaze (Xilkernel [46]). We use a hardware counter to measurethe Microblaze clock cycles. For the FPGA-VLIW mapping, we develop a code-generation backend for SCORE that uses the scheduler developed for the Model-Evaluation phase described previously in Sect. 3. We report cycle counts from theFPGA-VLIW scheduler.

5.4 Results

In Fig. 16, we show a 1.8× speedup for the Iteration Control phase of SPICEwhen comparing a spatial implementation with a host CPU-offload implementation.This suggests that a stand-alone FPGA accelerator execution can deliver betterperformance. We consider the impact of parallelizing the Iteration Control phase on


0.1

1

10

s27m

ux8ringoscs298s344s349s382dacs444s386s510s52610stagess641s713s953s820s832s1196s1238s142320stagess149430stages40stages50stagesr4k

Spe

edup

2.8x mean1.9x mean

1.6x mean

HybridVLIW Microblaze Sequential

Fig. 17 Speedup for the overall SPICE simulator for different iteration control implementations

the overall speedups of the FPGA accelerator. In Fig. 17, we show the overall SPICEspeedups under three implementation scenarios (1) offload to sequential host CPUover PCI (2) offload to Microblaze soft-processor, and (3) spatial implementationover hybrid VLIW design. We observe that the spatial implementation can delivermodest improvements of 2.8×(geomean) over the sequential CPU implementation.We can show this benefit by localizing all communication within the FPGA systemand exploiting data parallelism in the convergence detection and truncation errorcalculation steps. However the amount of overall improvement is not very highsince the Iteration Control phase accounts for merely ≈7% of sequential SPICEruntime. Other FPGA studies [38] prefer to implement such sequential fraction ofthe application on embedded soft-processors like the Xilinx Microblaze. We see thelimits of using the Microblaze (1.6× geomean speedup) to implement this sequentialcomputation as it can be worse than even offloading the processing to the hostCPU over PCI (1.9× geomean speedup). The Microblaze soft-processor offers poordouble-precision floating-point support and schedules computation sequentiallyover the ALU thereby limiting potential performance. In contrast, the spatialVLIW design exploits the available data parallelism and implements the state-machine processing with lightweight decision-making hardware thus deliveringbetter performance.

5.5 Future Work

We now sketch a few ideas for improving the performance of the Iteration controllerFPGA design.

1. We can overlap the Model-Evaluation phase with the Sparse Matrix-Solve phaseof SPICE. The overlap scheduler needs to statically compute a suitable orderingof the device evaluation in Model-Evaluation to match the dataflow ordering inMatrix-Solve.


2. Both Xilinx and Altera have announced FPGA platforms that are closely coupledwith fast sequential cores e.g. Xilinx-ARM Zynq platform and the Altera-Intel Atom platform. We need to reexamine the hardware–software partitioningproblem for these platforms to determine if our hybrid VLIW architecturecontinues to offer better scalability.

6 FPGA Implementation Methodology

We now explain the complete methodology and framework for mapping and runningSPICE simulations on an FPGA. As shown in Fig. 18, at a high-level, the SPICE userprovides a SPICE netlist for simulation acceleration. Our automated FPGA mappingflow first selects a logic bitstream (marked a© in Fig. 18) based on the type of thenonlinear device model being used e.g. bsim3 or bsim4 and the kind of SPICEanalysis requested (e.g. DC and Transient analysis). We pre-compile a handful of

DR

AM

FPG

A

Fig. 18 High-level FPGA SPICE usage flow


Fig. 19 FPGA SPICE mapping toolflow

FPGA logic bitstreams for the different nonlinear models and simply choose theright bitstream without invoking the expensive FPGA CAD flow at runtime. We pickan appropriate mix of nonlinear device configurations as demanded by the circuitwith constant-folding of common model parameters (e.g. parameters specific to theCMOS process like Tox, Vth0) to optimize this configuration. We then exploit thecircuit structure to build the sparse matrix and extract the dataflow graph througha one-time static analysis (marked b© in Fig. 18). We then assemble the device-specific constants (e.g. W, L of the transistors) and the sparse dataflow graph intoa memory image for the DRAM (marked c© in Fig. 18). Finally, we configurethe FPGA and run the FPGA simulation without any CPU intervention duringthe simulation. We can read back the simulation results from the FPGA for post-processing and analysis. At present, we are required to fit all phases of SPICE on theFPGA because dynamic reconfiguration is too slow to be useful for our benchmarkcircuits (reconfiguration itself takes 1–2 ms compared to a few milliseconds ofFPGA SPICE iteration time). We show the complete FPGA mapping flow in greaterdetail in Fig. 19 and cross-label the key steps from Fig. 18. The mapping flow isorganized into different paths customized for the specific SPICE phase.

In Fig. 20 we show how the different portions of the FPGA fabric co-operate toexecute a SPICE simulation. In Step ( 1©), we configure the FPGA with a suitablelogic bitstream and download the memory image onto the DRAM to set up thesimulation. In Step ( 2©) we stream the device parameters through the Model-Evaluation VLIW circuit to process each device and generate the current (RHS x)and conductance (matrix A) contributions while also checking for convergence inthe Iteration-Control. These contributions are inputs to the Matrix-Solve phase inStep ( 3©). Next, in Step ( 4©), the dataflow graphs are streamed through the tokendataflow architecture from the off-chip memory to solve for unknown (vector x)along with a convergence check in the Iteration Control. The voltage vector (x) is


Fig. 20 FPGA SPICE execution flow

input for the next iteration of Model-Evaluation as shown in Step ( 5©). The IterationController runs the analysis state-machines that advance or terminate the simulationas appropriate.

6.1 Offline Logic Configuration

We generate the logic for implementing the VLIW, Dataflow and Streamingarchitectures by choosing an appropriate balance of area and memory resourcesthrough an area-time trade-off analysis. In Fig. 21, we show Area-Time trade-offsfor the different phases of SPICE and pick a feasible configuration by rapidlyevaluating all possible configurations in the small space of possible configurations.We mark the feasible configurations in Fig. 21 and show exact resource utilizationof these feasible points in Table 11. For the composite design, we are restrictedto an 8-tile VLIW engine for Model-Evaluation, a 6-tile VLIW engine for IterationControl and a 3×3 tile architecture for Matrix-Solve. The FPGA logic configurationincludes the VLIW programming for the PEs and switches of the Model-Evaluationand Iteration Control blocks.


0

0.2

0.4

0.6

0.8

1

1.2

105104103

Tim

e (N

orm

aliz

ed)

Area (Slices)

Virtex-6

Iteration ControlModel-Evaluation

Sparse Matrix-Solve

Fig. 21 FPGA SPICEarea-time trade-offs

Table 11 FPGA resource distribution for complete SPICE solver (Virtex-6 LX760, bsim4 model)

Area Memory DSPs

SPICE phase Slices % BRAMs % 48E1 %

Model-evaluation 62,512 53 448 62 176 20Sparse matrix-solve 27,090 23 180 25 99 11Iteration control 17,848 15 32 5 77 9

Total 107,450 91 660 92 352 40

6.2 Runtime Memory Configuration

For each circuit, we must program memory resources to store the circuit-specificvariables and data structures relevant for the simulation. This is primarily necessaryto support the circuit-specific matrix factorization graph required for the SparseMatrix Solve phase. For the nonlinear devices and independent sources, we storethe device-specific constant parameters from the circuit netlist in FPGA on-chipmemory or off-chip DRAM memory if necessary. We load a few simulation controlparameters (e.g. abstol, reltol, final time) to help the Iteration Controlphase declare convergence and termination of the simulation. We also need togenerate a static dataflow graph for the Matrix-Solve phase at the start of thesimulation through symbolic analysis. We distribute the sparse dataflow graphacross the Matrix-Solve processing elements (shown by the “Graph Placement”block in Fig. 19) and store the graph in off-chip DRAM memory when it does not fiton-chip capacity. We compute a static ordering of loads from the off-chip memory toappropriately stream the graph structure on-chip. Once we have the dataflow graphs,we assign nodes to PEs of our parallel architecture using placement for locality withMLPart [3].


Table 12 Area and latency model for SPICE hardware (Virtex-6 LX760), multiplyblock uses 11 DSP48 units

BlockArea(slices)

Latency(clocks)

Speed(MHz) Ref.

Double-precision floating-point operatorsAdd 334 8 344 [43]Multiply 131 10 294 [43]Divide 1,606 57 277 [43]Square root 822 57 282 [43, 44]Exponential 1,022 30 200 [8]Logarithm 1,561 30 200 [8]

Network elementsTM BFT T-Switch 48 2 300 [23, 31]TM BFT Pi-Switch 64 2 300 [23, 31]PS Mesh Switch 642 4 312 –Switch-Switch 32 2 300 –

Processing elements and miscellaneousVLIW Tile Ctrl. 82 – 300 –Dataflow PE Ctrl. 297 – 270 –Microblaze Ctrl. 1,504 – 100 –DDR2 Ctrl. 1,892 – 250 [32]

6.3 Hardware Library and Cost Model

We tabulate the resource requirements and performance characteristics of thecompositional hardware elements in Table 12. We use spatial implementations ofindividual floating-point add, multiply, divide, and square-root operators from theXilinx Floating-Point library in CoreGen [44]. For the exponential and logarithmoperators we use FPLibrary from Arenaire [8] group. For the Model-Evaluationand Iteration Control architectures, we interconnect the operators using a time-multiplexed butterfly-fat-tree (BFT) network that routes 64-bit doubles (or 32-bitfloats when considering Single-Precision implementation) using time-multiplexedswitches. For the Matrix-Solve architecture, we interconnect the floating-pointoperators using a bidirectional mesh packet-switched network that routes 84-bit,1-flit packets (64-bit double and 20-bit node address) using DOR. We use ahardware generation framework to automatically generate structural VHDL codefor the system based on selected implementation parameters such as system size,network topology, and network bandwidth. The software infrastructure to supporttime-multiplexed scheduling and packet-switched simulation is extended to providethis hardware generation functionality. We store the static schedules as read-onlyconstants in local on-chip distributed memories.


Cycles = max(Tmodeleval +Tmatsolve ,Titerctrl(dp))+Titerctrl(stmc)

Tmodeleval = VLIW Model-Evaluation cycles

Tmatsolve = Dataflow Matrix-Solve cycles

Titerctrl(dp)= Data-Parallel Iteration-Control cycles

Titerctrl(stmc)= State-Machine Iteration-Control cycles

Fig. 22 Measuring FPGAcycle count

6.4 FPGA Cycle Measurement

We express the total number of cycles required by our FPGA implementation asshown in Fig. 22. This model assumes we must fit all three phases of the SPICEiteration on the FPGA simultaneously. The model also assumes an overlappingof a part of the Iteration Control phase with the other two phases of SPICE. Inour model, we only consider the execution times of SPICE iteration and excludethe initial simulation setup time (e.g. circuit parsing, matrix construction, matrixstatic analysis). This setup cost is small (usually proportional to 1–2 SPICEiteration times [20]) and is common to both sequential CPU and parallel FPGAimplementations. This cost is easily amortized over sufficiently large number ofiterations where the FPGA-SPICE accelerator is expected to be used.

We report cycle counts from the time-multiplexed schedule (Model-Evaluationand Iteration Controller) and a cycle-accurate simulation (Matrix-Solve). We esti-mate memory load time for large matrices using streaming loads over the externalDDR2-500 MHz memory interface using lowerbound bandwidth calculations. Forour speedup calculation, we consider graph loading times as well as vector anddevice constant loading times from external memory.

7 Evaluation

We report the achieved performance and energy requirements of our parallelSPICE implementation. In Fig. 23a, we compare SPICE runtime on an Intel Corei7 965 with a Virtex-6 LX760 FPGA across benchmark circuits of increasingsizes. We observe a geomean speedup of 2.8× across our benchmark set witha peak speedup of 11× for the largest benchmark. We also show the ratio ofenergy consumption between the two architectures in Fig. 23b. We estimate powerconsumption of the FPGA using the Xilinx XPower tool assuming 20% activityon the flip-flops, on-chip memory ports and external IO ports. When using thismodel, the FPGA consumes up to 40.9×(geomean 8.9×) lower energy than themicroprocessor implementation. To the first order, observed speedup and energybenefits are proportional to the size of the benchmark. Larger benchmarks admitgreater parallelism across multiple independent devices for the Model-Evaluationphase. Regular circuits with low fanout and high locality also generate goodparallelism for the Sparse Matrix-Solve phase. Thus, the variations in acceleration


1

3

5

7

9

11

13a

b

s27m


Spe

edup

2.8x mean

0

10

20

30

40

50

s27m


Ene

rgy

Sav

ings

8.9x mean

Fig. 23 Comparing Xilinx Virtex-6 LX760 FPGA (40 nm) and Intel Core i7 965 (45 nm)implementations. (a) Total per-chip speedup; (b) energy ratio

can be explained in terms of the size and differences in circuit structure across ourbenchmark set. We expect this accelerator to be particularly useful for speeding upSPICE simulations of large circuits (millions to billions of transistors) where thesequential implementations can take days or weeks or runtime. Our performancescaling trends in Fig. 23a do suggest a favorable increase in speedup with increasingcircuit size.

8 Conclusions

We show how to use FPGAs to accelerate the SPICE circuit simulator up to anorder of magnitude while also delivering an order of magnitude energy reductionwhen comparing a Xilinx Virtex-6 LX760 with an Intel Core i7 965. We were ableto deliver these speedups by exposing available parallelism in all phases of SPICEusing a high-level, domain-specific framework and customizing FPGA hardware tomatch the nature of parallelism in each phase. We were able to compose the overall


Table 13 Circuit simulation benchmark matrices

Bmarks. Matrix size Sparsity (%) Total ops. Fanout FaninLatency(cycles)

Simucad [39]mux8 42 15.0793 626 8 20 1.9Kringosc 104 6.4903 1.6K 4 92 3.7Kdac 654 1.5849 23.6K 10 1,136 7.7K

Clocktrees [40]r4k1 39,948 0.0131 515.5K 6 29,910 127.8K

Wave-pipelined Interconnect [41]10stages 3,920 0.1753 72.7K 8 2,384 18.6K20stages 11,225 0.0618 219.2K 9 9,442 46.2K30stages 16,815 0.0410 306.0K 11 4,688 88.6K40stages 22,405 0.0307 395.7K 9 600 134.2K50stages 27,995 0.0245 493.9K 10 484 169.7K

ISCAS89 Netlists [2]s27 189 3.4405 2.1K 6 50 3.6Ks208 1,296 0.5277 19.7K 11 1,414 11.3Ks298 1,801 0.4026 32.6K 13 1,938 13.1Ks344 1,992 0.3522 32.3K 12 2,178 14.7Ks349 2,017 0.3512 33.9K 14 2,218 14.7Ks382 2,219 0.3184 37.2K 16 2,358 16.1Ks444 2,409 0.2952 41.4K 16 2,526 16.6Ks386 2,487 0.2927 46.4K 20 2,626 15.7Ks510 2,621 0.3124 105.3K 54 2,722 21.4Ks526n 3,154 0.2362 66.1K 25 3,280 21.9Ks526 3,159 0.2376 68.1K 26 3,294 20.7Ks641 3,740 0.2000 100.2K 39 4,066 26.5Ks713 4,040 0.1890 126.4K 47 4,380 30.3Ks820 4,625 0.1655 103.2K 29 4,766 26.1Ks832 4,715 0.1629 105.7K 29 4,846 26.6Ks953 4,872 0.1876 353.9K 85 5,212 37.9Ks1196 6,604 0.1399 475.3K 83 7,146 46.4Ks1238 6,899 0.1325 457.9K 78 7,454 46.6Ks1423 9,304 0.0820 296.0K 64 10,384 64.5Ks1488 9,849 0.0827 354.7K 49 10,606 54.8Ks1494 9,919 0.0817 352.4K 50 10,646 54.6K

heterogeneous design that mixes VLIW, Dataflow and Streaming organizationsinto a unified implementation with the assistance of suitable SCORE compositionframework. The tools and techniques we develop for mapping SPICE to FPGAs aregeneral and applicable to a broader range of designs. We believe the ideas exploredin this research are relevant across an important class of problems where compu-tation is characterized by static, data-parallel processing and where the algorithm


operates on sparse, irregular data structures. Such high-level approaches based onexploiting spatial parallelism will become important for improving performance andenergy-efficiency of general-purpose computation.

Appendix

We show the matrix characteristics of the circuit benchmarks used in our exper-iments in Table 13. We use RAM netlists (Simucad [39]), clocktrees (Universityof Michigan [40]), wave-pipelined circuits (UBC [41]) and the ISCAS 1989benchmark set (IBM [2]).

References

1. A.M. Bayoumi, Y.Y. Hanafy, Massive parallelization of SPICE device model evaluationon GPU-based SIMD architectures, in Proceedings of the 1st International Forum onNext-Generation Multicore/Manycore Technologies, Cairo, Egypt (ACM, New York, 2008),pp. 1–5

2. F. Brglez, D. Bryan, K. Kozminski, Combinational profiles of sequential benchmark circuits.IEEE Int. Symp. Circ. Syst. 3, 1929–1934 (1989)

3. A. Caldwell, A. Kahng, I. Markov, Improved algorithms for hypergraph bipartitioning, inProceedings of the 2000 Asia and South Pacific Design Automation Conference (2000),pp. 661–666

4. E. Caspi, Design Automation for Streaming Systems. Ph.D., University of California, Berkeley,2005

5. Chung-Wen Ho, A. Ruehli, P. Brennan, The modified nodal approach to network analysis.IEEE Trans. Circ. Syst. 22(6), 504–509 (1975)

6. B. Conn, XPICE Circuit Simulation Software. (unpublished) (2008)7. L. Dagum, R. Menon, OpenMP: an industry standard API for shared-memory programming.

IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)8. F. de Dinechin, J. Detrey, O. Cret, R. Tudoran, When FPGAs are better at floating-point

than microprocessors, in Proceedings of the International ACM/SIGDA Symposium on Field-Programmable Gate Arrays (ACM, New York, NY, USA, 2008), p. 260

9. A. Dehon, Y. Markovsky, E. Caspi, M. Chu, R. Huang, S. Perissakis, L. Pozzi, J. Yeh,J. Wawrzynek, Stream computations organized for reconfigurable execution. Microprocess.Microsyst. 30(6), 334–354 (2006)

10. M. DeLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T.E. Uribe, T.F.J. Knight,A. DeHon, GraphStep: a system architecture for sparse-graph algorithms, in IEEE Symposiumon Field-Programmable Custom Computing Machines (IEEE, Piscataway, NJ, USA, 2006),pp. 143–151

11. J. Duato, S. Yalamanchili, N. Lionel, Interconnection Networks: An Engineering Approach(Morgan Kaufmann, Los Altos, 2002)

12. J.A. Fisher, The VLIW machine: a multiprocessor for compiling scientific code. IEEE Comput.17(7), 45–53 (1984)

13. J. Gilbert, T. Peierls, Sparse partial pivoting in time proportional to arithmetic operations.SIAM J. Sci. Stat. Comput. 9(5), 862–874 (1988)


14. K. Gulati, J.F. Croix, S.P. Khatri, R. Shastry, Fast circuit simulation on graphics processingunits, in Proceedings of the Asia and South Pacific Design Automation Conference (IEEE,Piscataway, NJ, USA, 2009), pp. 403–408

15. J. Hennesey, D. Patterson, Computer Architecture A Quantitative Approach, 2nd edn. (MorganKauffman, Los Altos, 1996)

16. S. Hutchinson, E. Keiter, R. Hoekstra, H. Watts, A. Waters, R. Schells, S. Wix, The Xyceparallel electronic simulator - An overview, in IEEE International Symposium on Circuits andSystems (IEEE, Piscataway, NJ, USA, 2000)

17. Intel, Intel Math Kernel Library 10.2.5.035 (Intel, USA, 2005)18. N. Kapre, A. DeHon, Optimistic parallelization of floating-point accumulation, in IEEE

Symposium on Computer Arithmetic (IEEE Computer Society, Washington DC, USA, 2007),pp. 205–216

19. N. Kapre, A. DeHon, Accelerating SPICE model-evaluation using FPGAs, in IEEE Symposiumon Field Programmable Custom Computing Machines (IEEE, New York, 2009), pp. 37–44

20. N. Kapre, A. DeHon, Parallelizing sparse matrix solve for SPICE circuit simulation usingFPGAs, in International Conference on Field-Programmable Technology (IEEE, Piscataway,NJ, USA, 2009), pp. 190–198

21. N. Kapre, A. DeHon, Performance comparison of single-precision SPICE model-evaluationon FPGA, GPU, Cell, and multi-core processors, in International Conference on FieldProgrammable Logic and Applications (IEEE, Piscataway, NJ, USA, 2009), pp. 65–72

22. N. Kapre, A. DeHon, VLIW-SCORE: beyond C for sequential control of SPICE FPGA accel-eration, in International Conference on Field-Programmable Technology (IEEE, Piscataway,NJ, USA, 2011)

23. N. Kapre, N. Mehta, M. DeLorimier, R. Rubin, H. Barnor, M. Wilson, M. Wrighton, A. DeHon,Packet switched vs. time multiplexed FPGA overlay networks, in IEEE Symposium on Field-Programmable Custom Computing Machines (IEEE, Piscataway, NJ, USA, 2006), pp. 205–216

24. K.S. Kundert, A. Sangiovanni-Vincentelli, Sparse User’s Guide: A Sparse Linear EquationSolver (1988)

25. P. Lee, S. Ito, T. Hashimoto, J. Sato, T. Touma, G. Yokomizo, A parallel and accelerated circuitsimulator with precise accuracy, in Proceedings of the 2002 Asia and South Pacific DesignAutomation Conference (IEEE, Piscataway, NJ, USA, 2002), pp. 213–218

26. L. Lemaitre, G. Coram, C. McAndrew, K. Kundert, M. Inc, S. Geneva, Extensions to Verilog-A to support compact device modeling, in Proceedings of the Behavioral Modeling andSimulation Conference (IEEE, Piscataway, NJ, USA, 2003), pp. 7–8

27. D. Lewis, A programmable hardware accelerator for compiled electrical simulation, inProceedings of the 25th ACM/IEEE Design Automation Conference (IEEE, Piscataway, NJ,USA, 1988), pp. 172–177

28. D. Lewis, A compiled-code hardware accelerator for circuit simulation, in IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems (IEEE, Piscataway, NJ, USA,1992), pp. 555–565

29. M. Linderman, M. Ho, D. Dill, T. Meng, G. Nolan, Towards program optimization throughautomated analysis of numerical precision, in Proceedings of the IEEE/ ACM InternationalSymposium on Code Generation and Optimization (ACM, New York, 2010), pp. 230–237

30. H. Martorel, N. Kapre, FX-SCORE: a framework for fixed-point compilation of SPICE devicemodels using Gappa ++, in IEEE Symposium on Field Programmable Custom ComputingMachines (IEEE, Piscataway, NJ, USA, 2012)

31. N. Mehta, Time-Multiplexed FPGA Overlay Networks On Chip. Master’s thesis, CaliforniaInstitute of Technology, 2006

32. Microsoft Research, DDR2 DRAM Controller for BEE3 ( Microsoft Research, USA, 2008)33. P. Mucci, S. Browne, C. Deane, G. Ho, PAPI: a portable interface to hardware performance

counters, in Proceedings of the Department of Defense High Performance Computing Mod-ernization Program Users Group Conference (IEEE Computer Society, Washington DC, USA,1999), pp. 7–10


34. L.W. Nagel, SPICE2: A Computer Program to Simulate Semiconductor Circuits. Ph.D. thesis,University of California Berkeley, 1975

35. E. Natarajan, KLU A High Performance Sparse Linear Solver for Circuit Simulation Problems.Master’s thesis, University of Florida Gainesville, 2005

36. G. Papadopoulos, D. Culler, Monsoon: an explicit token-store architecture. Proc. Annu. Int.Symp. Comput. Archit. 18(3a), 82–91 (1990)

37. H. Peng, C.K. Cheng, Parallel transistor level circuit simulation using domain decompositionmethods, in Proceedings of the Asia and South Pacific Design Automation Conference (IEEE,Piscataway, 2009), pp. 397–402

38. A. Putnam, S. Eggers, D. Bennett, E. Dellinger, J. Mason, H. Styles, P. Sundararajan, R. Wittig,Performance and power of cache-based reconfigurable computing, in Proceedings of theInternational Symposium on Computer Architecture, vol. 37 (ACM, New York, 2009), p. 395

39. Simucad/Silvaco, BSIM3, BSIM4 and PSP benchmarks from Simucad (Simucad (now Sil-vaco), USA, 2007)

40. C. Sze, P. Restle, G. Nam, C. Alpert, ISPD2009 clock network synthesis contest, in Proceed-ings of the 2009 International Symposium on Physical design (ACM, New York, 2009), p. 149

41. P. Teehan, G. Lemieux, M. Greenstreet, Towards reliable 5Gbps wave-pipelined and 3Gbpssurfing interconnect in 65nm FPGAs, in Proceeding of the ACM/SIGDA International Sympo-sium on Field Programmable Gate Arrays (ACM, New York, 2009), pp. 43–52

42. Q. Wang, D.M. Lewis, Automated field-programmable compute accelerator design usingpartial evaluation, in Proceedings of the 5th Annual IEEE Symposium on FPGAs for CustomComputing Machines, Napa Valley, 1997, pp. 145–154

43. Xilinx, Xilinx CoreGen Reference Guide, 2100 Logic Drive, SanJose, CA, 95124, USA (2000).www.xilinx.com

44. Xilinx, Floating-Point Operator v5.0, 2100 Logic Drive, SanJose, CA, 95124, USA (2009).www.xilinx.com

45. Xilinx, MicroBlaze Processor Reference Guide, 2100 Logic Drive, SanJose, CA, 95124, USA(2010). www.xilinx.com

46. Xilinx, OS and Libraries Document Collection. Technical report, 2100 Logic Drive San Jose,CA 95124, USA (2010). www.xilinx.com

47. X. Ye, W. Dong, P. Li, S. Nassif, MAPS: multi-algorithm parallel circuit simulation, inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (IEEE,Piscataway, NJ, USA, 2008), pp. 73–78

www.xilinx.com

www.xilinx.com

www.xilinx.com

www.xilinx.com

Date post:	09-Dec-2016
Category:	Documents
Upload:	khaled
View:	216 times
Download:	3 times

High-Performance Computing Using FPGAs || Accelerating the SPICE Circuit Simulator Using an FPGA: A...

Documents