Blazing Saddles: Getting the Performance Out of VCS
1
Blazing Saddles: Getting the Performance Out of VCS
Gregg D. LahtiCorrent Corporation
Tim SchneiderSynopsys Corporation
Gopal VarshneyCorrent Corporation
Blazing Saddles: Getting the Performance Out of VCS
2
2Gregg D. Lahti
High Noon in ASICville
• It’s two weeks before your chip tape-out deadline:– A new bug is found…– You must fix the bug and re-run your regression simulations
across the compute ranch…– And still make the tapeout date….– Or you may be the next gateslinger to be in the manager’s
layoff sights at High Noon…
Are you SURE you’re getting the most performance from VCS?
Blazing Saddles: Getting the Performance Out of VCS
3
3Gregg D. Lahti
VCS Usage Model
• Consider your VCS Usage– Large designs yield a large number of tests
At Corrent, over 2000 total tests for 3M gate and 8M gate ASICs
– 75% of an Engineer’s time is debugging the design in specific areas
• VCS needs to be utilized in two modes:– Debugging mode, where extra visibility into the simulation is
required– Regression mode, where performance is required
Blazing Saddles: Getting the Performance Out of VCS
4
4Gregg D. Lahti
VCS Debugging Mode
• Useful for point problems• Dumping signal state takes VCS resources
– Slows down simulation speed, especially lots of I/O to disk– Usually includes some debugger like Debussy
Blazing Saddles: Getting the Performance Out of VCS
5
5Gregg D. Lahti
VCS Regression Mode
• Optimize for speed!• Usually many tests run in batch mode• Just verifying pass/fail operation of design• Debug only tests that fail in debugging mode with
signal state saving turned
Blazing Saddles: Getting the Performance Out of VCS
6
6Gregg D. LahtiItems That Kill VCS
Performance
• VCS performance can be hampered by command-line switches used without discretion:– -I The interractive mode– -PP Post processing of dumpfile– +cli Turns on command-line interactive mode– +acc+2 Used for backwards-compatible PLI calls– Lack of –Mupdate Loss of saved-state compile info– -P [library] Compile in libraries that may not be required
Blazing Saddles: Getting the Performance Out of VCS
7
7Gregg D. LahtiItems That Kill VCS
Performance (cont)
• Coding styles can kill VCS performance– Delay loops or unneeded timing assignments kill
performance, as VCS cannot optimize the execution:
always @(posedge clk or negedge reset_n) beginif (~reset_n) q <= #0 0;else q <= #1 d;end // always
– #0 and #1 delays kill performance by as much as 200% with an average increase of 30-50%!1
1 Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper.
Blazing Saddles: Getting the Performance Out of VCS
8
8Gregg D. LahtiItems That Kill VCS
Performance (cont)
• Use of initial and always blocks– The following construct is legal Verilog but may cause VCS
to lockup in an infinite loop:
always beginmysig = 1;
end // always
– Mixing blocking and non-block assignments may cause poor VCS simulation speed results and bad logic if coded incorrectly. Not a good idea to mix these!
Blazing Saddles: Getting the Performance Out of VCS
9
9Gregg D. LahtiItems That Kill VCS
Performance (cont)
• PLI and .tab files• Disk & Network I/O
– Optimize your simulations by running on local disk– Network filesystems can be 10X slower!– Bribe your sysadmin for /tmp space, run as follows:
-o /tmp/siv –Mdir /tmp/csrc
Blazing Saddles: Getting the Performance Out of VCS
10
10Gregg D. LahtiItems That Improve VCS
Performance
• Command line switches– +rad Verilog pre-processor that optimizes code
• Coding styles– Remove #0 and #1 delays– Separate sequential items from combinatorial processes
• In-line C instead of PLI calls– Use the Direct Kernel Interface (DKI)– Use the DKI for Debussy
+vcsd along with the proper vcsd .tab file
http://www.solvnet.synopsys.com/retrieve/900611.html
Blazing Saddles: Getting the Performance Out of VCS
11
11Gregg D. Lahti
Cleaning the pli.tab File
• PLI tab files can be optimized!• Look for the following in your .tab files:
acc:rw,cbka: *
– The * signifies every signal in your design gets read/write access and visibility by the debugger.
– Streamline this, as you probably don’t need EVERY signal!• Debussy ships with a very unoptimized pli.tab file:
– Example line:– $fsdbDumpvars check=plicompileDumpvars call=plitaskDumpvars
misc=plimiscFSDB acc=read,callback_all:%*
– Replace the %* with %TASK, improves performance by as much as 15-20%
Replace this!Replace this!
Blazing Saddles: Getting the Performance Out of VCS
12
12Gregg D. LahtiCleaning the pli.tab File
(cont)
• If your home-grown PLI’s and debugging PLI’s are still dragging down simulation, use the +vcs+pli+learn flag.– Run the simulation with this flag turned on– VCS figures out which PLI calls are utilized in the design– VCS generates a new pli.tab file to be used
• Useful for simulation speed improvements, results vary based on PLI usage
• Caveat Emptor:– Change your PLI interface or usage, need to re-run with flag
or risk incorrect/failing simulation!
Blazing Saddles: Getting the Performance Out of VCS
13
13Gregg D. Lahti
Profiling Simulations
• Profile your simulation to see where the time is spent• Useful to see if code, PLI or library is causing the
bottleneck• Easy with VCS 5.2 and later:
– use +prof in command line compile script– VCS creates a vcs.prof outputfile– Read the file, see where the time is spent
Blazing Saddles: Getting the Performance Out of VCS
14
14Gregg D. Lahti
• Output of vcs.prof log:
// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 976.180 seconds
======================================================================TOP LEVEL VIEW
======================================================================TYPE %Totaltime
----------------------------------------------------------------------PLI 0.23VCD 0.99
KERNEL 7.76DESIGN 91.02
Profiling Simulations (cont)
Total Simulation TimeTotal Simulation Time
Blazing Saddles: Getting the Performance Out of VCS
15
15Gregg D. Lahti
Profiling Simulations (cont)
=====================================================================MODULE VIEW
=====================================================================Module(index) %Totaltime No of Instances Definition---------------------------------------------------------------------delaychain (1) 67.75 56 ../top/rtl/delaychain.v:15.dll_delay_line (2) 7.16 2 ../rtl/ddrctlr/rtl/dll_delay_line.v:21.ckrst (3) 2.47 1 ../top/rtl/ckrst.v:13.INVDL (4) 1.25 8431 /projects/clibs/umc/0.15vst/tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32
.hurricane_tb (5) 1.23 1 ../tb/hurricane_tb.v:31.pdisp (6) 1.16 8 ../rtl/hurricane/rtl/pdisp.v:33.dll_mux (7) 0.96 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.dll_buf (8) 0.56 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.spsram_1536x32 (9) 0.54 8 /projects/clibs/rams_nobist_M4one/0.15vst_2.0/Verilog_fix/spsram_1536x32.v:8.
Blazing Saddles: Getting the Performance Out of VCS
16
16Gregg D. Lahti
Profiling Simulations (cont)
• Ouch! Spending 67% of VCS time in delaychain.v– We used this to hand-tune the clock trees from the clock generation
block (get around anemic Apollo clock tree insertion issues).– Delaychain.v code looks like this:
module delaychain (sigin,sigout);input sigin;output [120:0] sigout;wire [120:0] sigout;BUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));INVDL inv_u001 (.A(sigin), .Z(sigout[1]));…INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));Endmodule
Blazing Saddles: Getting the Performance Out of VCS
17
17Gregg D. Lahti
Profiling Simulations (cont)
• Compile and run with +nospecify gives us better performance:
// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 149.660 seconds
======================================================================TOP LEVEL VIEW
======================================================================TYPE %Totaltime
----------------------------------------------------------------------PLI 1.65VCD 0.02
KERNEL 11.86DESIGN 86.46
Blazing Saddles: Getting the Performance Out of VCS
18
18Gregg D. Lahti
Profiling Simulations (cont)
• Better run, but still spending time in delaychain.v=======================================================================
MODULE VIEW=======================================================================Module(index) %Totaltime No of Instances Definition----------------------------------------------------------------------delaychain (1) 17.24 56 ../top/rtl/delaychain.v:15.hurricane_tb (2) 8.43 1 ../tb/hurricane_tb_nodump.v:31.dll_mux (3) 5.06 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.INVDL (4) 4.12 8431 /projects/clibs/umc/0.15vst/tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32
.pdisp (5) 3.83 8 ../rtl/hurricane/rtl/pdisp.v:33.dll_buf (6) 3.10 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.rctl (7) 2.35 8 ../rtl/hurricane/rtl/rctl.v:32.dll_delay_element (8) 2.24 1720 ../rtl/ddrctlr/rtl/
dll_delay_element.v:20.
Blazing Saddles: Getting the Performance Out of VCS
19
19Gregg D. Lahti
Profiling Simulations (cont)
• Change the Verilog, ifdef the instanced gates only for synthesis and gate-level simulationmodule delaychain (
sigin,sigout
);input sigin;output [120:0] sigout;wire [120:0] sigout;`ifdef SYNTH_DELAYCHAINBUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));INVDL inv_u001 (.A(sigin), .Z(sigout[1]));…INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));`else assign sigout = { sigin, {60{!sigin, sigin}} };`endifEndmodule
Conditionally compile-inConditionally compile-in
RTL version, VCScan optimize this!
RTL version, VCScan optimize this!
Blazing Saddles: Getting the Performance Out of VCS
20
20Gregg D. Lahti
Profiling Simulations (cont)
• Compile and run with +nospecify and new IFDEF’ed delaychain.v provides much better performance:
// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 124.410 seconds
======================================================================TOP LEVEL VIEW
======================================================================TYPE %Totaltime
----------------------------------------------------------------------PLI 1.15VCD 0.02
KERNEL 13.34DESIGN 85.50
Blazing Saddles: Getting the Performance Out of VCS
21
21Gregg D. Lahti
Profiling Simulations (cont)
• The delaychain.v isn’t on the top list of CPU hogs:=======================================================================
MODULE VIEW=======================================================================Module(index) %Totaltime No of Instances Definition----------------------------------------------------------------hurricane_tb (1) 6.79 1 ../tb/hurricane_tb_nodump.v:31.dll_mux (2) 5.90 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.pdisp (3) 4.92 8 ../rtl/hurricane/rtl/pdisp.v:33.rctl (4) 3.76 8 ../rtl/hurricane/rtl/rctl.v:32.dll_buf (5) 3.66 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.xaux_regs (6) 2.93 8 ../rtl/hurricane/rtl/xaux_regs.v:249.dll_delay_element(7)2.63 1720 ../rtl/ddrctlr/rtl/dll_delay_element.v:20.delaychain (8) 2.08 56 ../top/rtl/delaychain.v:15.tdc_cdb (9) 1.67 1 ../rtl/tdc/rtl/tdc_cdb.v:16.spsram_1536x32 (10) 1.32 8 /projects/clibs/rams_nobist_M4one/0.15vst_2.0/verilog_fix/spsram_1536x32.v:8.
Blazing Saddles: Getting the Performance Out of VCS
22
22Gregg D. Lahti
Speeding Gate Simulations
• Gate-level, back-annoated simulations are really slow, but useful– Checks for real-world conditions that STA may have missed– Useful for boot/power-up testing (does your chip come out
of reset?)– Ensure that the layout netlist works as specified
• Good place for simulation speed improvements!• VCS has two switches that optimize gate-level
simulation:– +timopt– +memopt
Blazing Saddles: Getting the Performance Out of VCS
23
23Gregg D. LahtiSpeeding Gate Simulations
(cont)
• The +timopt flag:– Optimization based on clock signals and sequential devices
in design– Useful since +rad can’t optimize SDF annotated designs– Used as +timopt+time, where time is the smallest clock
period in the design– VCS generates a configuration file that shows more
optimization that can be done by hand– In one Corrent design, 32% of the design was optimized
using +timopt– Speed improvement varies, our test case measured a 20%
improvement
Blazing Saddles: Getting the Performance Out of VCS
24
24Gregg D. LahtiSpeeding Gate Simulations
(cont)
• +memopt can compress memory structures during compile– Useful if gate sims don’t fit into the memory foot print (i.e.
Linux ~3GB process size limitation)– May not always work, process overflows process-size limit– Use +memopt+2 to spawn second child process for
compilation
• On Linux bump the process size limit from generic 3GB size to 3.7GB size:– Edit /usr/src/linux-2.4/include/asm-i386/page.h– Change 0xC0000000 to 0xEC000000
Blazing Saddles: Getting the Performance Out of VCS
25
25Gregg D. Lahti
Performance Results
• Corrent increased RTL-based simulation performance by over 6X!– Average speed increase measured over 15 different
simulation runs of a 15M gate RTL design– Incremental changes measured between flag settings– Profiling was essential!
• Corrent increased gate-level simulations by average of 22% using +timopt– Some simulations were able to fit into 3.7GB of process size
with +memopt
Blazing Saddles: Getting the Performance Out of VCS
26
26Gregg D. Lahti
Performance Results (cont)
Test Name A B C D E ahb_cfg_test1 30.770 7.340 7.360 6.650 5.940 ahb_cfg_test2 13.130 4.370 4.380 4.170 3.740 ahb_cfg_test_incr 186.830 33.860 33.780 29.530 25.740 ahb_cfg_wr_rd 67.610 13.980 13.940 12.400 10.980 ahb_ddr_bw 115.140 22.310 22.250 18.970 16.680 ahb_ddr_test1 61.570 13.000 12.940 11.260 9.890 ahb_ddr_test_incr 103.870 21.420 21.270 18.510 16.320 ahb_ddr_wr_rd 100.100 20.520 20.340 17.400 15.400 ahb_memctl_test1 217.370 34.560 34.330 29.610 25.220 ahb_memctl_test_sdram 131.330 23.970 23.900 21.170 18.420 gmi_mission_test1 416.290 76.550 76.510 67.790 59.320 gmi_mission_test2 416.880 76.840 76.940 67.890 59.560 gmi_mission_test3 416.340 76.850 76.630 67.670 59.470 gmi_pause_test1 76.250 15.730 15.700 13.920 12.560 gmi_ser_test 97.630 18.230 18.180 16.160 14.140 Speed Increase over (A) - 533% 534% 608% 693%
A: Baseline scriptB: Removal of instanced
gate-level delay chainsC: Remove +acc+2, -I, and
–PP switchesD: Remove compile-in
Debussy PLI and otherdebugging PLIs
E: With +nospecify switch
A: Baseline scriptB: Removal of instanced
gate-level delay chainsC: Remove +acc+2, -I, and
–PP switchesD: Remove compile-in
Debussy PLI and otherdebugging PLIs
E: With +nospecify switch
Blazing Saddles: Getting the Performance Out of VCS
27
27Gregg D. Lahti
Performance Results (cont)
• Command line switches alone:Test Name B C D E
ahb_cfg_test1 7.340 7.360 6.650 5.940 ahb_cfg_test2 4.370 4.380 4.170 3.740 ahb_cfg_test_incr 33.860 33.780 29.530 25.740 ahb_cfg_wr_rd 13.980 13.940 12.400 10.980 ahb_ddr_bw 22.310 22.250 18.970 16.680 ahb_ddr_test1 13.000 12.940 11.260 9.890 ahb_ddr_test_incr 21.420 21.270 18.510 16.320 ahb_ddr_wr_rd 20.520 20.340 17.400 15.400 ahb_memctl_test1 34.560 34.330 29.610 25.220 ahb_memctl_test_sdram 23.970 23.900 21.170 18.420 gmi_mission_test1 76.550 76.510 67.790 59.320 gmi_mission_test2 76.840 76.940 67.890 59.560 gmi_mission_test3 76.850 76.630 67.670 59.470 gmi_pause_test1 15.730 15.700 13.920 12.560 gmi_ser_test 18.230 18.180 16.160 14.140 Speed Increase over (B) - 0% 14% 30%
A: Baseline scriptB: Removal of instanced
gate-level delay chainsC: Remove +acc+2, -I, and
–PP switchesD: Remove compile-in
Debussy PLI and otherdebugging PLIs
E: With +nospecify switch
A: Baseline scriptB: Removal of instanced
gate-level delay chainsC: Remove +acc+2, -I, and
–PP switchesD: Remove compile-in
Debussy PLI and otherdebugging PLIs
E: With +nospecify switch
Blazing Saddles: Getting the Performance Out of VCS
28
28Gregg D. Lahti
Summary
• Clean the simulation scripts by removing the following:-I+acc+2-PLI [library] (unused PLI calls)-PP
• Add these flags into the simulation scripts:-Mupdate –o csrc (use local disk)+rad+nospecify+nbaopt (if required)
Blazing Saddles: Getting the Performance Out of VCS
29
29Gregg D. Lahti
Summary (cont)
• Clean your Verilog of #0 and #1 delays• Optimize your pli.tab files• Profile your simulation! Nasty time-sinks can be
resolved!
Blazing Saddles: Getting the Performance Out of VCS
30
30Gregg D. Lahti
References
• VCS 5.0 and 6.0 User Guides, Synopsys Corporation, 2002.• Test Benches: The Dark Side of IP Reuse, Gregg D. Lahti, San Jose SNUG 2000
paper. http://gateslinger.com/chiphead.htm or http://www.synopsys.com/news/pubs/snug/snug00/lahti_final.pdf
• Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper. http://www.sunburst-design.com/papers/.
• ESNUG posts: 380 item 11, 383 item 9, 387 item 16. http://deepchip.com/esnug.html
• Solvnet: http://solvnet.synopsys.com
Special thanks to Mark Warren for the fruit basket and review of the paper!