+ All Categories
Home > Documents > Performance Profile

Performance Profile

Date post: 15-Jan-2016
Category:
Upload: max
View: 19 times
Download: 0 times
Share this document with a friend
Description:
how to code to make simulation faster
Popular Tags:
30
Blazing Saddles: Getting the Performance Out of VCS 1 Blazing Saddles: Getting the Performance Out of VCS Gregg D. Lahti Corrent Corporation Tim Schneider Synopsys Corporation Gopal Varshney Corrent Corporation
Transcript
Page 1: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

1

Blazing Saddles: Getting the Performance Out of VCS

Gregg D. LahtiCorrent Corporation

Tim SchneiderSynopsys Corporation

Gopal VarshneyCorrent Corporation

Page 2: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

2

2Gregg D. Lahti

High Noon in ASICville

• It’s two weeks before your chip tape-out deadline:– A new bug is found…– You must fix the bug and re-run your regression simulations

across the compute ranch…– And still make the tapeout date….– Or you may be the next gateslinger to be in the manager’s

layoff sights at High Noon…

Are you SURE you’re getting the most performance from VCS?

Page 3: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

3

3Gregg D. Lahti

VCS Usage Model

• Consider your VCS Usage– Large designs yield a large number of tests

At Corrent, over 2000 total tests for 3M gate and 8M gate ASICs

– 75% of an Engineer’s time is debugging the design in specific areas

• VCS needs to be utilized in two modes:– Debugging mode, where extra visibility into the simulation is

required– Regression mode, where performance is required

Page 4: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

4

4Gregg D. Lahti

VCS Debugging Mode

• Useful for point problems• Dumping signal state takes VCS resources

– Slows down simulation speed, especially lots of I/O to disk– Usually includes some debugger like Debussy

Page 5: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

5

5Gregg D. Lahti

VCS Regression Mode

• Optimize for speed!• Usually many tests run in batch mode• Just verifying pass/fail operation of design• Debug only tests that fail in debugging mode with

signal state saving turned

Page 6: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

6

6Gregg D. LahtiItems That Kill VCS

Performance

• VCS performance can be hampered by command-line switches used without discretion:– -I The interractive mode– -PP Post processing of dumpfile– +cli Turns on command-line interactive mode– +acc+2 Used for backwards-compatible PLI calls– Lack of –Mupdate Loss of saved-state compile info– -P [library] Compile in libraries that may not be required

Page 7: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

7

7Gregg D. LahtiItems That Kill VCS

Performance (cont)

• Coding styles can kill VCS performance– Delay loops or unneeded timing assignments kill

performance, as VCS cannot optimize the execution:

always @(posedge clk or negedge reset_n) beginif (~reset_n) q <= #0 0;else q <= #1 d;end // always

– #0 and #1 delays kill performance by as much as 200% with an average increase of 30-50%!1

1 Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper.

Page 8: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

8

8Gregg D. LahtiItems That Kill VCS

Performance (cont)

• Use of initial and always blocks– The following construct is legal Verilog but may cause VCS

to lockup in an infinite loop:

always beginmysig = 1;

end // always

– Mixing blocking and non-block assignments may cause poor VCS simulation speed results and bad logic if coded incorrectly. Not a good idea to mix these!

Page 9: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

9

9Gregg D. LahtiItems That Kill VCS

Performance (cont)

• PLI and .tab files• Disk & Network I/O

– Optimize your simulations by running on local disk– Network filesystems can be 10X slower!– Bribe your sysadmin for /tmp space, run as follows:

-o /tmp/siv –Mdir /tmp/csrc

Page 10: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

10

10Gregg D. LahtiItems That Improve VCS

Performance

• Command line switches– +rad Verilog pre-processor that optimizes code

• Coding styles– Remove #0 and #1 delays– Separate sequential items from combinatorial processes

• In-line C instead of PLI calls– Use the Direct Kernel Interface (DKI)– Use the DKI for Debussy

+vcsd along with the proper vcsd .tab file

http://www.solvnet.synopsys.com/retrieve/900611.html

Page 11: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

11

11Gregg D. Lahti

Cleaning the pli.tab File

• PLI tab files can be optimized!• Look for the following in your .tab files:

acc:rw,cbka: *

– The * signifies every signal in your design gets read/write access and visibility by the debugger.

– Streamline this, as you probably don’t need EVERY signal!• Debussy ships with a very unoptimized pli.tab file:

– Example line:– $fsdbDumpvars check=plicompileDumpvars call=plitaskDumpvars

misc=plimiscFSDB acc=read,callback_all:%*

– Replace the %* with %TASK, improves performance by as much as 15-20%

Replace this!Replace this!

Page 12: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

12

12Gregg D. LahtiCleaning the pli.tab File

(cont)

• If your home-grown PLI’s and debugging PLI’s are still dragging down simulation, use the +vcs+pli+learn flag.– Run the simulation with this flag turned on– VCS figures out which PLI calls are utilized in the design– VCS generates a new pli.tab file to be used

• Useful for simulation speed improvements, results vary based on PLI usage

• Caveat Emptor:– Change your PLI interface or usage, need to re-run with flag

or risk incorrect/failing simulation!

Page 13: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

13

13Gregg D. Lahti

Profiling Simulations

• Profile your simulation to see where the time is spent• Useful to see if code, PLI or library is causing the

bottleneck• Easy with VCS 5.2 and later:

– use +prof in command line compile script– VCS creates a vcs.prof outputfile– Read the file, see where the time is spent

Page 14: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

14

14Gregg D. Lahti

• Output of vcs.prof log:

// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 976.180 seconds

======================================================================TOP LEVEL VIEW

======================================================================TYPE %Totaltime

----------------------------------------------------------------------PLI 0.23VCD 0.99

KERNEL 7.76DESIGN 91.02

Profiling Simulations (cont)

Total Simulation TimeTotal Simulation Time

Page 15: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

15

15Gregg D. Lahti

Profiling Simulations (cont)

=====================================================================MODULE VIEW

=====================================================================Module(index) %Totaltime No of Instances Definition---------------------------------------------------------------------delaychain (1) 67.75 56 ../top/rtl/delaychain.v:15.dll_delay_line (2) 7.16 2 ../rtl/ddrctlr/rtl/dll_delay_line.v:21.ckrst (3) 2.47 1 ../top/rtl/ckrst.v:13.INVDL (4) 1.25 8431 /projects/clibs/umc/0.15vst/tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32

.hurricane_tb (5) 1.23 1 ../tb/hurricane_tb.v:31.pdisp (6) 1.16 8 ../rtl/hurricane/rtl/pdisp.v:33.dll_mux (7) 0.96 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.dll_buf (8) 0.56 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.spsram_1536x32 (9) 0.54 8 /projects/clibs/rams_nobist_M4one/0.15vst_2.0/Verilog_fix/spsram_1536x32.v:8.

Page 16: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

16

16Gregg D. Lahti

Profiling Simulations (cont)

• Ouch! Spending 67% of VCS time in delaychain.v– We used this to hand-tune the clock trees from the clock generation

block (get around anemic Apollo clock tree insertion issues).– Delaychain.v code looks like this:

module delaychain (sigin,sigout);input sigin;output [120:0] sigout;wire [120:0] sigout;BUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));INVDL inv_u001 (.A(sigin), .Z(sigout[1]));…INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));Endmodule

Page 17: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

17

17Gregg D. Lahti

Profiling Simulations (cont)

• Compile and run with +nospecify gives us better performance:

// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 149.660 seconds

======================================================================TOP LEVEL VIEW

======================================================================TYPE %Totaltime

----------------------------------------------------------------------PLI 1.65VCD 0.02

KERNEL 11.86DESIGN 86.46

Page 18: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

18

18Gregg D. Lahti

Profiling Simulations (cont)

• Better run, but still spending time in delaychain.v=======================================================================

MODULE VIEW=======================================================================Module(index) %Totaltime No of Instances Definition----------------------------------------------------------------------delaychain (1) 17.24 56 ../top/rtl/delaychain.v:15.hurricane_tb (2) 8.43 1 ../tb/hurricane_tb_nodump.v:31.dll_mux (3) 5.06 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.INVDL (4) 4.12 8431 /projects/clibs/umc/0.15vst/tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32

.pdisp (5) 3.83 8 ../rtl/hurricane/rtl/pdisp.v:33.dll_buf (6) 3.10 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.rctl (7) 2.35 8 ../rtl/hurricane/rtl/rctl.v:32.dll_delay_element (8) 2.24 1720 ../rtl/ddrctlr/rtl/

dll_delay_element.v:20.

Page 19: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

19

19Gregg D. Lahti

Profiling Simulations (cont)

• Change the Verilog, ifdef the instanced gates only for synthesis and gate-level simulationmodule delaychain (

sigin,sigout

);input sigin;output [120:0] sigout;wire [120:0] sigout;`ifdef SYNTH_DELAYCHAINBUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));INVDL inv_u001 (.A(sigin), .Z(sigout[1]));…INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));`else assign sigout = { sigin, {60{!sigin, sigin}} };`endifEndmodule

Conditionally compile-inConditionally compile-in

RTL version, VCScan optimize this!

RTL version, VCScan optimize this!

Page 20: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

20

20Gregg D. Lahti

Profiling Simulations (cont)

• Compile and run with +nospecify and new IFDEF’ed delaychain.v provides much better performance:

// Synopsys VCS 6.2R12// Simulation profile: vcs.prof// Simulation Time: 124.410 seconds

======================================================================TOP LEVEL VIEW

======================================================================TYPE %Totaltime

----------------------------------------------------------------------PLI 1.15VCD 0.02

KERNEL 13.34DESIGN 85.50

Page 21: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

21

21Gregg D. Lahti

Profiling Simulations (cont)

• The delaychain.v isn’t on the top list of CPU hogs:=======================================================================

MODULE VIEW=======================================================================Module(index) %Totaltime No of Instances Definition----------------------------------------------------------------hurricane_tb (1) 6.79 1 ../tb/hurricane_tb_nodump.v:31.dll_mux (2) 5.90 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.pdisp (3) 4.92 8 ../rtl/hurricane/rtl/pdisp.v:33.rctl (4) 3.76 8 ../rtl/hurricane/rtl/rctl.v:32.dll_buf (5) 3.66 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.xaux_regs (6) 2.93 8 ../rtl/hurricane/rtl/xaux_regs.v:249.dll_delay_element(7)2.63 1720 ../rtl/ddrctlr/rtl/dll_delay_element.v:20.delaychain (8) 2.08 56 ../top/rtl/delaychain.v:15.tdc_cdb (9) 1.67 1 ../rtl/tdc/rtl/tdc_cdb.v:16.spsram_1536x32 (10) 1.32 8 /projects/clibs/rams_nobist_M4one/0.15vst_2.0/verilog_fix/spsram_1536x32.v:8.

Page 22: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

22

22Gregg D. Lahti

Speeding Gate Simulations

• Gate-level, back-annoated simulations are really slow, but useful– Checks for real-world conditions that STA may have missed– Useful for boot/power-up testing (does your chip come out

of reset?)– Ensure that the layout netlist works as specified

• Good place for simulation speed improvements!• VCS has two switches that optimize gate-level

simulation:– +timopt– +memopt

Page 23: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

23

23Gregg D. LahtiSpeeding Gate Simulations

(cont)

• The +timopt flag:– Optimization based on clock signals and sequential devices

in design– Useful since +rad can’t optimize SDF annotated designs– Used as +timopt+time, where time is the smallest clock

period in the design– VCS generates a configuration file that shows more

optimization that can be done by hand– In one Corrent design, 32% of the design was optimized

using +timopt– Speed improvement varies, our test case measured a 20%

improvement

Page 24: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

24

24Gregg D. LahtiSpeeding Gate Simulations

(cont)

• +memopt can compress memory structures during compile– Useful if gate sims don’t fit into the memory foot print (i.e.

Linux ~3GB process size limitation)– May not always work, process overflows process-size limit– Use +memopt+2 to spawn second child process for

compilation

• On Linux bump the process size limit from generic 3GB size to 3.7GB size:– Edit /usr/src/linux-2.4/include/asm-i386/page.h– Change 0xC0000000 to 0xEC000000

Page 25: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

25

25Gregg D. Lahti

Performance Results

• Corrent increased RTL-based simulation performance by over 6X!– Average speed increase measured over 15 different

simulation runs of a 15M gate RTL design– Incremental changes measured between flag settings– Profiling was essential!

• Corrent increased gate-level simulations by average of 22% using +timopt– Some simulations were able to fit into 3.7GB of process size

with +memopt

Page 26: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

26

26Gregg D. Lahti

Performance Results (cont)

Test Name A B C D E ahb_cfg_test1 30.770 7.340 7.360 6.650 5.940 ahb_cfg_test2 13.130 4.370 4.380 4.170 3.740 ahb_cfg_test_incr 186.830 33.860 33.780 29.530 25.740 ahb_cfg_wr_rd 67.610 13.980 13.940 12.400 10.980 ahb_ddr_bw 115.140 22.310 22.250 18.970 16.680 ahb_ddr_test1 61.570 13.000 12.940 11.260 9.890 ahb_ddr_test_incr 103.870 21.420 21.270 18.510 16.320 ahb_ddr_wr_rd 100.100 20.520 20.340 17.400 15.400 ahb_memctl_test1 217.370 34.560 34.330 29.610 25.220 ahb_memctl_test_sdram 131.330 23.970 23.900 21.170 18.420 gmi_mission_test1 416.290 76.550 76.510 67.790 59.320 gmi_mission_test2 416.880 76.840 76.940 67.890 59.560 gmi_mission_test3 416.340 76.850 76.630 67.670 59.470 gmi_pause_test1 76.250 15.730 15.700 13.920 12.560 gmi_ser_test 97.630 18.230 18.180 16.160 14.140 Speed Increase over (A) - 533% 534% 608% 693%

A: Baseline scriptB: Removal of instanced

gate-level delay chainsC: Remove +acc+2, -I, and

–PP switchesD: Remove compile-in

Debussy PLI and otherdebugging PLIs

E: With +nospecify switch

A: Baseline scriptB: Removal of instanced

gate-level delay chainsC: Remove +acc+2, -I, and

–PP switchesD: Remove compile-in

Debussy PLI and otherdebugging PLIs

E: With +nospecify switch

Page 27: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

27

27Gregg D. Lahti

Performance Results (cont)

• Command line switches alone:Test Name B C D E

ahb_cfg_test1 7.340 7.360 6.650 5.940 ahb_cfg_test2 4.370 4.380 4.170 3.740 ahb_cfg_test_incr 33.860 33.780 29.530 25.740 ahb_cfg_wr_rd 13.980 13.940 12.400 10.980 ahb_ddr_bw 22.310 22.250 18.970 16.680 ahb_ddr_test1 13.000 12.940 11.260 9.890 ahb_ddr_test_incr 21.420 21.270 18.510 16.320 ahb_ddr_wr_rd 20.520 20.340 17.400 15.400 ahb_memctl_test1 34.560 34.330 29.610 25.220 ahb_memctl_test_sdram 23.970 23.900 21.170 18.420 gmi_mission_test1 76.550 76.510 67.790 59.320 gmi_mission_test2 76.840 76.940 67.890 59.560 gmi_mission_test3 76.850 76.630 67.670 59.470 gmi_pause_test1 15.730 15.700 13.920 12.560 gmi_ser_test 18.230 18.180 16.160 14.140 Speed Increase over (B) - 0% 14% 30%

A: Baseline scriptB: Removal of instanced

gate-level delay chainsC: Remove +acc+2, -I, and

–PP switchesD: Remove compile-in

Debussy PLI and otherdebugging PLIs

E: With +nospecify switch

A: Baseline scriptB: Removal of instanced

gate-level delay chainsC: Remove +acc+2, -I, and

–PP switchesD: Remove compile-in

Debussy PLI and otherdebugging PLIs

E: With +nospecify switch

Page 28: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

28

28Gregg D. Lahti

Summary

• Clean the simulation scripts by removing the following:-I+acc+2-PLI [library] (unused PLI calls)-PP

• Add these flags into the simulation scripts:-Mupdate –o csrc (use local disk)+rad+nospecify+nbaopt (if required)

Page 29: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

29

29Gregg D. Lahti

Summary (cont)

• Clean your Verilog of #0 and #1 delays• Optimize your pli.tab files• Profile your simulation! Nasty time-sinks can be

resolved!

Page 30: Performance Profile

Blazing Saddles: Getting the Performance Out of VCS

30

30Gregg D. Lahti

References

• VCS 5.0 and 6.0 User Guides, Synopsys Corporation, 2002.• Test Benches: The Dark Side of IP Reuse, Gregg D. Lahti, San Jose SNUG 2000

paper. http://gateslinger.com/chiphead.htm or http://www.synopsys.com/news/pubs/snug/snug00/lahti_final.pdf

• Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper. http://www.sunburst-design.com/papers/.

• ESNUG posts: 380 item 11, 383 item 9, 387 item 16. http://deepchip.com/esnug.html

• Solvnet: http://solvnet.synopsys.com

Special thanks to Mark Warren for the fruit basket and review of the paper!


Recommended