Verification and debugging of hardware designs utilizing C-based high- level design
descriptions
Masahiro FujitaVLSI Design and Education Center (VDEC)
University of Tokyo
System level design flow
conception
Functional decomposition
Function-SeparatedDescription
ArchitecturalExploration
HW1SW
Performance AnalysisHW/SW partitioning
ComponentLibraries
StructuralDecomposition
HW2
High Level Synthesis
HW3
Ideas
Components
Design as aMeeting Point
Automatic / Manual
High Level / Transactional
RTL
C/TLM
C/HL
RTL to RTLOptimization
RTL Synthesis
• Typical design flow used in industry• Mostly C language based design descriptions for
embedded system and SoC designs
(Electric) System level design tools: Elegant
Traditional design starts here…
System level design
Algorithm design
HW/SWpartitioning
Communicationnetwork design
Interface design Hardware
implementationdesign (RTL)
Softwareimplementation
design
A
B
C
D
Inte
rfac
e ci
rcui
ts
Proc
esso
r mod
el
Devi
ce d
river
SW HW
A ?B ?
C ?
D ?A B
C D
HW ?
HW ?
SW ?
SW ?
A
B
C
D
Prot
ocol
Prot
ocol
SW HW
3
SystemC/SpecC
• Joint development by JAXA, Toshiba, NEC, Fujitsu, UC Irvine, and U. of Tokyo (formal verification)
• Help assigning functions onto componentsStatic/model checking, Equivalence checking
JAXA: Japanese space exploration agencyElegant tool has actually been used for space satellites
Outline• Overviews of our verification research activities
regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph
(ExSDG)• Equivalence checking as well as static/model
checking with ExSDG• Post-silicon verification method through mapping
chip traces with ExSDG• Verification/debugging methods for large arithmetic
circuits• Automatic generation of on-chip bus protocol
transducers
System Dependence Graph
int i = 1;main(){ int a = 0; int b = 0; a = a + i; if(a == 1) b = a++;}
main
a = 0;b = 0;
a = a + i;
if(a==1)
b = a++;
a++
i = 1;
int i int bint a
Control dependence
DependencyAnalysis
Data dependence
Declaration dependenceSystem Dependence Graph (SDG)
• Sufficient representation since verification methods for net-list are applicable to SDG
• Problems– Existing SDGs are too complicated (# nodes/edges are huge)– Not directly corresponding to abstract syntax trees
Extended Syntaxes in ExSDG from C
・・・wait(a);・・・・・・・・・notify(b);・・・
・・・・・・notify(a);・・・wait(b);・・・・・・
Hierarchical Structure a b c
par{ a.main(); b.main(); c.main();}
Concurrency
Synchronization
bit a[3:0];
a
bit b[3];
b
a[2:0]@b[1]
Bit Vector
Timed Behavior
buffered bit[1] a;a = 1;waitfor(1);waitfor(1);a = 0;waitfor(1);
clka
Module
Sub Module 1
Sub Module 2
port
Translation Flow to ExSDG
SpecCDesign
VerilogDesign
VHDLDesign
ExSDG
SystemCDesign
System Level/ Behavior Level
RTL
System-VerilogDesign
AST(TBL)
AST(UBL)
AST(RTL)
Untimed Behavior Level: ・ No notion of time ・ For software design or behavior specification
Timed Behavior Level: ・ Including timing specification ・ For design with timing estimationRegister Transfer Level: ・ Cycle accurate ・ For hardware design
Simplifi-cation
Complex or redundant syntaxes are removed ex) switch statement
Three Design Stagesmodule TEST( int in1, int out1, event e) { Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } wait(e); }};Untimed Behavior Level:・ Concurrency and Synchronization・ Times expressions and buffered/wire variablesare prohibited
module TEST( int in1, int out1 event e){ Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } waitfor(1); wait(e); }};Timed Behavior Level:・ Untimed Behavior Level + Timed expressions・ buffered/wire variables are prohibited
module TEST( wire int in1, wire int out1){ buffered int a; void init(){ out1 = a; } void run_one_cycle(){ a = in1; } main(){}};Register Transfer Level:
・ par and wait/notify are prohibited・ init(): Wire connections are declared・ run_one_cycle(): Executed per clock
Edges in ExSDG• Control Flow Edge• Data Dependence Edge• Control Dependence Edge• Declaration Dependence Edge• Parameter In/Out• Summary Edge (Interprocedural Dependence
Edge)• Parallel Edge• Communication Edge• Port Reference Edge
Concurrent and Synchronization Dependence
int x, y;event e1,e2;void main(void){ y = 0; x = 5; par{ func1(); func2(); }}
void func1(){ y = x; notify(e1); wait(e2); x = y;}
void func2(){ wait(e1); y++; y = y * y; notify(e2);}
Process 1 Process 2
Parallel Edge
par
func1(); func2();
end
notify(e1);
Communcation Edge
wait(e1);
wait(e2); notify(e2);
Hierarchical Dependencemodule M1(int in, int out, event e1){ void main(void){ wait(e1); out = in * in; }};
module M3(int inout){ int w1; event e1; M1 m1(w1, inout, e1); M2 m2(inout, w1, e1); void main(void){ par{ m1.main(); m2.main(); } }};
module M2(int in int out event e1){ void main(void){ out = in + in; notify(e1); }};
int M3::w1; event M3::e1; int M3::inout;
int M::in; int M1::out; event M1::e1;
int M2::in; int M2::out; event M2::e1;
Port Reference Edge
Experimental Results• Compared with a SpecC SDG generation method using
CodeSurfer• Performed on a PC with Xeon 3.2GB and 2GB memory
# of nodes # of edges
ExSDG 453 2061
SDG generated by CodeSurfer 6380 48073
Example NoL CodeSurfer ExSDGIDCT 420 8.7s 5.2s
Elevator 3055 9.1s 222.2sMPEG4 5657 30.4s 902.9s
SDG Generation Time
Num. of Nodes and Edges in IDCT Example
Outline• Overviews of our verification research activities
regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph
(ExSDG)• Equivalence checking as well as static/model
checking with ExSDG• Post-silicon verification method through mapping
chip traces with ExSDG• Verification/debugging methods for large arithmetic
circuits• Automatic generation of on-chip bus protocol
transducers
Algorithmic designdescription
High level synthesizabledescription
Register Transfer Leveldescription
Many steps of manual refinement
Many steps of manual refinement
Design optimization
High levelsynthesis
Design optimization
Design optimization
Static/Model checking
Sequential Equivalence Checking
From the viewpoint of verification• Keep entire descriptions as correct as possible• Static/Model checking each description• Equivalence checking between two descriptions• Besides simulation, formal methods should be applied
Basic Procedure• Symbolic simulation
– Generates a set of equations from designs– Every variable/operation is an uninterpreted symbol– Every expression is a formula of symbols
• Equivalence Class (EqvClass)– A class containing all equivalent expressions/variables– Generated during symbolic simulation based on
• Assignment statement• Substitution with equivalent expressions/variables
• Solving with SMT solvers– If generated equations have to be solved to prove the desired
equivalence, we use SMT solvers to interpret arithmetics– Public ones and our own solves
Examplea = v1;b = v2;add1 = a + b;Description 1
add2 = v1 + v2;
Description 2
4 EqvClasses are generatedfrom 4 assignments
E1 (a, v1)E2 (b, v2)E3 (add1, a+b)E4 (add2, v1+v2)
E1 (a, v1)E2 (b, v2)E3’ (add1, a+b, add2, v1+v2)
E3 and E4 can be merged bysubstituting aa with v1v1, bb with v2v2
Representing Symbolic Expressions: Maximally Shared Graph
x0=a+b;if x0<0 then x1 = -x0; y1 = y0;else x1 = x0; y1 = y0+x0;assert(0<=y1);
> –
0
ITE
<=
condtrue
false
y0
+
x1 y1
VC:
ITE
false
cond
true
+
a b
x0
• Linear-sized representation– Mathematically equivalent to
standard logical representation
• Advantages– Structure explicit
[flow of data in the graphcorresponds to flow of data in the program]
– Simple slicing
• No structural redundancies• Not functionally
canonical– Practical trade-off
Maximally Shared Graph
> –
0
ITE
<=
condtrue
false
y0
+
x1 y1
VC:
ITE
false
cond
true
+
a b
x0
Problem and Our Approach• Symbolic simulation cannot be applied to a
whole large designs– Because of the path explosion problem– Approach: Localizing the areas that are
symbolically simulated utilizing differences
return a; return a; return a; return a;
: difference
Equivalence Checking FlowSeq.
desc1Seq.
desc2
Identification of diff.
Is there anydiff. left?
Decision of initial verificationarea and its inputs/outputs
Verification
“Equivalent”No
Yes
Not eqv
Extension of the Areaand decision of itsinputs/outputs
Eqv
VerificationEqv
Not eqv
Reach to primaryinputs/outputs?
Yes“Not equivalent”
No
Verification Area and its Input/Output• Verification area: A set of statements• Input variables
– Used in the area, and assigned outside• Output variables
– Assigned in the area, and used outside• Verification problem for the area
– Are all pairs of output variables with the same name are equivalent?
– Using proved equivalences of the input variables
Example
• Initial verification– Input … b0, in3– Output … a1– Result … Equivalence cannot be proved
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + in3;out0 = a1 * c0;
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + 2 * in3;out0 = a1 * c0;
Forward extension from a1
Diff
(※) in1, in2, in3 are theprimary inputs of boththe descriptions
• 2nd verification– Input … b0, c0, in3– Output … out0– Result … Equivalence cannot be proved
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + in3;out0 = a1 * c0;
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + 2 * in3;out0 = a1 * c0;
Backward extensions from all inputs(forward extension cannot be applied any more)
Diff
(※) in1, in2, in3 are theprimary inputs of boththe descriptions
Example
• 3rd verification– Input … a0, in2, c0 (= 0), in3– Output … out0– Result … Equivalent (due to c0 = 0)
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + in3;out0 = a1 * c0;
a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + 2 * in3;out0 = a1 * c0;
Though the different part are not equivalent,the primary output is equivalent
Diff
(※) in1, in2, in3 are theprimary inputs of boththe descriptions
Example
Implementation• FLEC: An Framework for Verification, Debugging
Support, and Static Checking– Several engines developed by ourselves
• Symbolic simulator, difference extraction, input pattern generation, static deadlock checker, slicing, …
– Dependence graph-based internal representation of system-level designs
• A number of APIs are provided to help development of engines
• ExSDG (Extended System Dependence Graph)– Designs are represented in form of ExSDG in FLEC– Frontend from SpecC to ExSDG is already developed
FLEC StructureSpecC
Design1(.sc)
SpecCDesign2
(.sc)
ExSDG(.fls)
ExSDG(.fls)
ExSDG Generation
Eqv. Classes
Eqv. Spec
Equivalencechecked Verification procedure
(Applied Method & Order)
SymbolicSimulator
SequentializingParallel Behaviors
Diff ExtractionRule-basedEngine
SMT solvers
Result
Control Control
Sequential equivalence checking• Definition of equivalence
– Intension of designers, management of differences
• For efficient checking:– Identification of matching states Bounded equivalence checking between matching states– Identification of equivalent internal points
• Use of SMT solvers
align
inputs
align
spec
impl
1
outputs
reset
reset
clocks Verifyfalsified
counterexample
spec
impl
proven
bounded or fullproof
Transactions : State View
• Encapsulates one or more units of computation for the design being verified
• Self-contained since it brings the machine back to a synchronizing state
Refinement mapping
SLM
RTLRTL transaction
SL transaction
Transaction : Memory
• Design 1 transaction : a single memory read/write occuring in a single cycle
• Design 2 transaction: single memory read/write (potentially) happening over multiple cycles
MemADDR
DATA
RD WR
OUT
Design 1
Design 2MemCache
Cache ctl
ADDR
DATA
RD WR
OUT
Transaction-Based Equivalence Checking
• The states in RTL that correspond to states in system-level model (SLM), are referred to as synchronizing states
• Based on this, definitions of equivalence are generated manually
• The total equivalence is based on inductions on synchronizing states
SLM
RTL
Refinement mapping
Transient states
Complete Verification
or
Sequentialcounterexample
Equivalence Specification• Equivalence is specified by
– (Port, Throughput, Latency, Condition)
behavior Adder1(in int in1, in int in2, in int in3, out int out1) { void main() { out1 = in1 + in2 + in3; }};
(in1, 1, 0, TRUE)(in2, 1, 0, TRUE)(in3, 1, 0, TRUE)(out1, 1, 0, TRUE)
behavior Adder2(in int in1, in int in2, in int in3, out int out1) { void main() { int tmp; while(1) { tmp = in1 + in2; waitfor(5); tmp = tmp + in3; waitfor(5); out1 = tmp; } }};
(in1, 10, 0, TRUE)(in2, 10, 0, TRUE)(in3, 10, 5, TRUE)(out1, 10, 10, TRUE)
behavior Adder3(in int in1, out int out1) { void main() { int tmp; while(1) { tmp = in1; waitfor(1); tmp = tmp + in1; waitfor(1); tmp = tmp + in1; waitfor(1); out1 = tmp; waitfor(1); } }};
(in1, 4, 0, TRUE)(in1, 4, 1, TRUE)(in1, 4, 2, TRUE)(out1, 4, 3, TRUE)
Sequentialization• Concurrent behaviors are sequentialized• If st1 and st2 running concurrently are “write-write” or “read-
write” relation, check the following properties:– P1: always T(st1) > T(st2)always T(st1) > T(st2)– P2: always T(st1) < T(st2)always T(st1) < T(st2)– T(s) … execution time of a statement s
• The checks are based on ILP– Is there any assignment satisfying P1 (P2) ?– With the timing constraints generated from SpecC designs
(P1, P2) = (pass, pass) Impossible(P1, P2) = (fail, pass) Always st1st2(P1, P2) = (pass, fail) Always st2st1(P1, P2) = (fail, fail) Order is undecidable
Can besequentialized
Sequentialization
a = 10;b = 10;c = a + b;
x = 20;y = 20;z = x + y;
No dependence
a = 10;wait e;c = a + x;
x = 20;notify e;y = 20;z = x + y;
Synchronized
x = 10;a = 10;c = a + x;
x = 20;y = 20;z = x + y;
Not synchronized
No check is needed
a = 10;b = 10;c = a + b;x = 20;y = 20;z = x + y;
always x=20 always x=20 c=a+x? c=a+x? Result: YESalways x=a+x always x=a+x x=20? x=20? Result: NOCan be sequentialized!!
a = 10;x = 20;y = 20;z = x + y;c = a + x;
always x=10 always x=10 x=20? x=20? Result: NOalways x=20 always x=20 x=10? x=10? Result: NOCannot be sequentialized!!
For HW/SW co-design
Softwarepart
RTLC
Transformto FSMD
Abstractionon
Comm.
SW(FSMD)
HW(FSMD)
IdentifySynch.points
SequentializationRecude# states
HW+SW(FSMD)
Model checking
Equivalence checking
Case Study 1: MPEG4• Difference between designs
– Constant propagation, constant folding, common sub-expression elimination in IDCT function
• Design size– About 6300 lines in SpecC– About 50k nodes and 36k edges in ExSDG
• ExSDG generation time: 780 sec
Nodes in diff
Result Run time # of ext.
MPEG4_org MPEG4_rev1 96 Eqv 3.3 sec 0MPEG4_org MPEG4_rev2 96 Not eqv 13.2 sec 80
Case Study 2: Elevator Controller• Difference between designs
– Speculative code motion in control paths• Design size
– About 3300 lines in SpecC– About 20k nodes and 20k edges in ExSDG
• ExSDG generation time: 178 sec
Nodes in diff
Result Run time # of ext.
Elv_org Elv_rev1 4 Eqv 1.8 sec 1Elv_org Elv_rev2 3 ---- > 12 hours 4
Rule-based Equivalence Checking
Checks the equivalence between two high-level (e.g. SpecC) design descriptions
• Assuming the equivalences of variables, equivalence rules are applied in a bottom-up manner– Equivalence rules are defined in terms of static dependence
relations and control flows• Verification result is either "equivalent" or "cannot
prove the equivalence"– It cannot prove that they are not equivalent– Equivalence rules are heuristically picked up
Equivalence Rules (1/3)Rule 1: Expression• Checks the equivalence considering the
commutative, associative, distributive laws
d * (b * a + c)d * a * b + c * d
Modified designOriginal design
*
b aCommutative law
Distributive law +
c
*
d
*
a b
*
d
+
*
c d
Equivalence Rules (2/3)Rule 2: Assignment• The variable in LHS is equivalent to RHS until the
variable is re-assigned
{ return a - b;}
{ int c = a - b; return c;}
Original design Introducing an intermediate variable
ー
a b
return
ー
a b
return=
c c
Rule 1
Rule 2
Equivalence Rules (3/3)Rule 3: Sequential composition• Execution order can be changed unless it destroys
the data dependence relations
{L1: c = a + b; L2: d = a + c; L3: e = b + c;}
Original design
{L2: d = a + c; L1: c = a + b; L3: e = b + c;}
Swapping L1 and L2
{L1: c = a + b; L3: e = b + c;L2: d = a + c; }
Swapping L2 and L3
seq
L1 L2 L3
seq
L2 L1 L3
seq
L1 L3 L2
Example: Bottom-Up Application of Rules
{ c = a - b; f = d + e;}
{ f = e + d; c = a - b;}
Original design Modified design
ー
a b
=
c +
d e
=
f
seq
+
e d
=
f
a b
=
c
seq
ー
Internal equivalences
Rule 1
Rule 2
Rule 3
A Known Issue of Rule-based Checking
How can we find internal equivalences?• Our initial method finds them by "name"
– That is, all variables with the same name are identified to be equivalent
• This approach fails the equivalence checking when variable names change– Typically, variable names are changed through
design transformations– If variables in different places have the same name,
the result may be false positive
Examples of Checking Failure Though following examples are all equivalent,
name-based equivalence checker fails
int ex1(int a, int b) { return a - b;}
int ex2(int b, int a) { return b - a;}
Original design Swapping the variable names
int ex3(int c, int d) { return c - d;}
int ex4(int a, int b) { int c = a - b; return c;}
Modifying the variable names Introducing an intermediate variable
Identifying Potential Internal Equivalences• Perform a random simulation, then identify a set of
variables having the same signature (i.e. sequence of simulated values)– Well-known technique in RTL verification
• In RTL, the values of registers are uniquely known at every cycle
– However, there is no concept of "cycle" in behavioral-level design descriptions
• The concept of "a variable with context" is introduced
e = a - b;f = a + b;
x = b + a;y = a - b;
a=(35,-4,712)b=(-220,1151,-3)
e=(255,-1155,715)f=(-185,1147,709)
x=(-185,1147,709)y=(255,-1155,715)Random Pattern
Design 2
Design 1
Signatures
{e, y}{f, x}
Potential Internal
Equivalences
Sim
ulat
ion
Internal Equivalences at Behavioral LevelDefinition of a cycle• A period of the execution from accepting an input pattern
to generating a set of output values• Internal variables must be assigned once in a cycle
– Designs are in static single assignment (SSA) form– The concept of "a variable with context" is introduced for multiple
instantiations of modules and function calls
A variable with context• Context: a runtime path information from the top-level
module to the current function• Guaranteed to be assigned only once in a cycle
Method Based on Potential Internal Equivalences
Issue: Potential internal equivalences may include false equivalences
• False equivalences may lead to false positive results• Non-equivalent variables may be identified as equivalent
– Equivalent variables are always identified as equivalent– True equivalences are included in potential equivalences
Solution: Explores all possible subsets of internal equivalences when applying the rules
• Still practical since each set of internal equivalences is typically small
ImplementationImplemented in C++ on top of our system-level
verification framework FLEC– Given a SpecC design description as an input, ExSDG
representation is generated by parser and dependence analyzer
• Potential internal equivalence identifier– Generates SpecC description from ExSDG representation
as well as a random input pattern generator module– Compiles the random simulator using SpecC reference
compiler• Rule-based equivalence checker
– Each rule is implemented as a callback function– Given two nodes in ExSDGs, the equivalence between two
sub-trees are checked
Case Study: A Practical Design
Example: Two IDCT designs before and after parallelization
• Column & row processes are parallelized• Existing method failed checking since it could
not find the correspondences between variables• Proposed method checked the equivalence
– 10 cycles of random simulation– Runtime: ~3 seconds
• Mainly compilation and execution time of simulator
Outline• Overviews of our verification research activities
regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph
(ExSDG)• Equivalence checking as well as static/model
checking with ExSDG• Post-silicon verification method through mapping
chip traces with ExSDG• Verification/debugging methods for large arithmetic
circuits• Automatic generation of on-chip bus protocol
transducers
Debug HardwareDebug Hardware
• Event ExtractorEvent Extractor– Extract basic required Extract basic required
informationinformation• Transaction type (read/write)Transaction type (read/write)• Start of transactionStart of transaction• End of transactionEnd of transaction• Initial and target address of Initial and target address of
transactiontransaction
• Trace BufferTrace Buffer– Store extracted transactions Store extracted transactions
in a compact formin a compact form
initiator channel target
…put(req)
get(req)
put(res)get(res)
true/falsetrue/false
true/falsetrue/false
master bus slave
…
m_request
m_select
opb_grant
opb_dbus
…
s_dbus
opb_xferack
s_xferack
Post Silicon Debug with ExSDGPost Silicon Debug with ExSDG
channelModule 1
…
Module k
Transaction Extractor
Trace Buffer
Software
Read Out
Initial System Debug SWDebug HW
Focus on communication parts of designs Establish mapping between ExSDG and chip traces
Debug Flow:1- Extract basic transactions2- Store in a trace buffer* Run system until a failure *3- Read the trace buffer 4- Analyze traces with software
Transaction type (read/write)Start & end of transactionInitial and target of transaction
Find wrong behaviors using debug patterns
Potential racePotential deadlock
Store extracted transactions
On-chip bus,Network, …
1
23
4
ExSDG for C
Post Silicon Debug: some resultsPost Silicon Debug: some results OPB Bus: 1966 gatesOPB Bus: 1966 gates PLB Bus: 2206 gatesPLB Bus: 2206 gates
• Hardware overhead is low• Trace buffer not large
– Trace buffer fields: 1 or 2 bytes per transaction
• Debug patters as assertions
• Analysis is quick: 20 seconds for 100,000 transactions– Working with ExSDG (C descriptions)
Master ID Slave ID Address R/W Command Tag
assert neverSoTr(m1, s1, Wr, -, t1) ; SoTr(m2, s1, Wr, SAME, t2) ;EoTr(m1, s1, Wr, SAME, t1)| EoTr(m1, s1, Wr, -, -) ; SoTr(m2, s1, Wr, SAME, -)filter (*,*,*)
1) Master1 locks first semaphore.2) Master2 locks second semaphore3) Master1 waits for second semaphore4) Master2 waits for first semaphore5) Steps 3 and 4 are repeated
Our approach to debugging high-level• Concrete simulation
– Depth-first search: long range, narrow width
• Formal method: Symbolic state traversal– Breadth-First Search: short range,
wide width
• Our approach: user-driven BFS combines DFS and BFS– DFS: To collect reachable states– BFS: To search exhaustively
around states of interest (user specified)
– User switches DFS and BFS using various commands including execution path specification
FF
FFFF
Reachable statesReachable states
InitialInitialStatesStates
Faulty states
DFS
DFS
FFFF
Reachable Reachable statesstates
InitialInitialStatesStates
FF
BFS
FF
FFReachable statesReachable states
InitialInitialStatesStates
DFS
FF
jump BFS
BFSBFS
Infeasiblestates
Some experiences • Filter design from a company
– 170 LoC in SpecC, part of buggy real design– BMC cannot detect the bug (simply too deep)– Designer specifies a set of execution paths that he has concerns
• Concern: some portions of the codes in particular sequences may not work
• Successfully generate 61 cycles pattern with the proposed approach
• Elevator controller– 9500 LoC in SpecC– Target assertion
• Door must open within 30 cycles after pushing up/down button while elevator is stopping on a floor
– BMC failed at 120th cycle (after >10 days run)– Looks like there is no such issue from BMC
• User-guided analysis realizes failure !
Outline• Overviews of our verification research activities
regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph
(ExSDG)• Equivalence checking as well as static/model
checking with ExSDG• Post-silicon verification method through mapping
chip traces with ExSDG• Verification/debugging methods for large arithmetic
circuits• Automatic generation of on-chip bus protocol
transducers
Data-path Dominated Applications
MATLAB Model(Fixed point)
Refinement and Optimization
RTL Model(Fixed bit-
width)
Gate-level Model
(bit level)
Physical Design
Equivalence Checking Modular-HED
Representation
Arithmetic
Bit Level
Floating pointModel
Automatic Fixed-point Generation
Real Number Specification
Debugging
RTL Synthesis
HED Representation Polynom
ialOptim
ization
Horner Expansion Diagram (HED)• Horner Expansion:
– Const (Left) child (dashed line)– Linear (Right) child (solid line)
linearconst fxfyxf .,...),(
fconst
x
f(x)
flinear
x
f(x,y,z)
x
y
z
10 10
z
-42
-4z+2xy+
z
z y
• Example– f(x,y,z) = x2y+xz-4z+2; Order: x>y>z– f(x,y,z) = [-4z+2]+x[xy+z]
= fconst+x.flinear
• fconst = -4z+2 = f(z)
• flinear =f1(x,y,z)= xy+z
– f1(x,y,z) = xy+z = z+x[y]• f1const = z; f1linear = y
– Horner form• f = x(x(y)+z)+z(-4)+2
High-level Polynomial Datapath VerificationModular Equivalence checking
– Anti-aliasing function– Expand into Taylor series
– Implemented as a fixed size datapath • F1[15:0], F2[15:0], x[15:0]
)2(1))(2(1 22 xbaF
MAC
x = a2 + b2
coefficients
a b
x
F
Reg
coefficients
• F1 = 156x6 + 62724x5 + 17968x4 + 18661x3 + 43593 x2 + 40244x + 13281
• F2 = 156x6 + 5380x5 + 1584x4 + 10469x3 + 27209 x2 + 7456x + 13281
• F1 ≠ F2 over Z• F1[15:0] = F2[15:0] mod 216
6485
3281
64279
1675
64115
329
641
2
3456
xx
xxxxF
Modular-HED (M-HED)• The Smarandache function in number theory is
defined for a given positive integer b as the smallest positive integer such that its factorial is divisible by b – Example, the number 8 does not divide 1!, 2!, 3!, but
does divide 4! (4!/8 = 3), so S(8) =4 • 1*2*3*4 % 8 = 0• 5*6*7*8 % 8 = 0• 100*101*102*103 % 8 = 0
– A product of 4 consecutive numbers is divisible by 8• x(x+1)(x+2)(x+3) 0 mod 8• x(x+1)(x+2)(x+3) = x4 + 6x3 + 11x2 + 6x can be freely added
or subtracted under mod 8 !• Co-efficients are modified transforming given polynomials
to normal forms
)!(bS
Modular-HED (M-HED)• Vanishing Polynomial:
g(x) = 0 mod n • If we can factorize a polynomial g(x) into a
product of S(n) consecutive numbers, then it can be reduced to 0 in – over– S(24) = 6
• f(x) = (x+1)(x+2)(x+3)(x+4)(x+5)(x+6) 0 mod 16 – It can be reduced to ZERO!
)(
1)(nS
iix
nZ
7201764162473517521)( 23456 xxxxxxxf 42Z
Preliminary Experimental ResultsBenchmark
Specs Modular-HED
Method[9]
CUDD-BDD
*BMD miniSat [13] MILP [12]
Var/Deg/n Node/Time Time (s) Node/Time Node/Time Vars/Clauses/Time Time (s)
AAF 1 / 6 / 16 8 / 0.016 6.81 1.1M / 32.2 NA / >500 3.9K / 107K / >500 >500
D4F 1 / 4 / 16 6 / 0.031 4.95 27M / 20.3 NA / >1000 25K / 76K / >1000 >1000
CHEB 1 / 5 / 16 7 / 0.01 5.95 1M / 26.9 NA / >500 3.5K / 86K / >500 >500
PSK 2 / 4 / 16 16 / 0.032 13.48 NA / >500 NA / >500 52K / 142K / >500 >500
DIRU 2 / 4 / 16 9 / 0.016 14.4 NA / >1000 NA / >1000 10K / 30K / >1000 >1000
MI 2 / 9 / 16 26 / 0.2 17.5 23M / 39.4 NA / >1000 24K / 69K / >1000 >1000
SG 5 / 3 / 16 35 / 0.24 6.1 NA / >1000 NA / >1000 64K / 190K / >1000 >1000
QS 7 / 4 / 16 19 / 0.09 32.4 NA / >1000 NA / >1000 76K / 211K / >1000 >1000
NA: Not Applicable; K: Thousand; M: Million
Boolean reasoning methods never worksNeed to use word level techniques such as
the proposed one
High-level Polynomial Datapath OptimizationA partitioning and compensation heuristic
• A polynomial is given, how we can optimize it in terms of the number of adders and multipliers on fixed bit-width
• Our Solution– Apply Modular-HED
• To reduce over Z2n
– Partitioning approach• Poly = p1*p2 + p3
• Minimize p3
– Compensation approach• Compute Coefficients
Modular-HED
Partitioning
Compensation
Partitioning Heuristic
• poly = p1p2+p3 with unknown coefficients
• Minimize the cost of p3 poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y
+5z +3
• After Partitioningp1 = a1x2 + a2y + a3z;
p2 = b1x + b2y +b3;
p3 = w4 + 3
poly = p1*p2 + p3
Partitioning
poly
p1, p2, p3
• In our example poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y +5z +3p1 = a1x2 + a2y + a3z p2 = b1x + b2y +b3 p3 = w4 +3
• Set the coefficients (ai,bj) in order to achieve the minimum cost p3
• First consider all the equations produced by p1*p2 = poly - p3– a1b1=-1, a1b2=-1,a1b3=-5,a2b1=2,a2b2=2,a3b1=1, a3b2=1,
a2b3=14, a3b3=5– These equations may not have an answer!
Compensation Heuristic
Preliminary Experimental ResultsApplicatio
n Function M/V/D/nHorner Enhanced CSE Our Approach
#Gate Delay Time #Gate Delay Time #Gate Delay Time
Graphics Cosine-Wavelet
9/2/3/16 7850 23.7 0.04 5109 18.3 3.47 3678 18.5 6.52
Image Processing
Savitzky-Golay 1
10/2/3/16 7218 20.7 0.04 3757 22.7 4.75 2879 17.8 14.9
Image Processing
Savitzky-Golay 2
5/2/2/16 1697 20.8 0.03 1433 18.1 0.48 1057 16.2 1.63
Filter Quad 1 5/2/2/16 2737 17.7 0.03 2269 16.4 0.63 1763 17.1 1.78
Filter Quad 2 5/2/2/16 2569 16.4 0.03 2032 16.4 0.63 1571 15.3 1.56
Automotive Mibench 9/3/2/16 2058 15.7 0.04 2046 15.6 0.46 1303 14.1 8.75
Average Saving w.r.t Horner 0% 0% - 31% 6.3% - 49.2% 13.8% -
M: the number of monomialV: the number of variables
D: the maximum degreen: the number of bits (word-length)
Intensive optimization on polynomials greatly reduce the gerated custom circuits/software
Bit Level Adder (BLA) Model• In an arithmetic circuit several
addition processes are possible• However, each addition
can be represented with BLA model
• Represent A+2mB, by some custom full-adders and half-adders, which represent two cascade XORs and one XOR, respectively – Realization of the carry signals
is not uniquely modeled• BLA is robust for
– Carry-look-ahead– Ripple-carry– Carry-select– Carry-skip adders
The Proposed Debugging Algorithm• Partial product initialization
– Extract each column of 2i from S– Provide a bit-level ADD_SET with
different bit-orders– Bits in each column must be added
with each other while the generated carries are sent to higher order column
• Column-based XOR extraction– Search XORs over initial partial
product terms of each ADD_SET column
– At least one input from column 2i
– XOR extraction is performed without carry-logic blocks
• Much faster than other XOR extraction techniques
P1(2)P2(2)
P3(2)Unknown
The Proposed Debugging Algorithm• Carry-signal mapping using HED
– For each FA with no unknown input build a new reference (in HED)
– For each HA/FA with an unknown input Cm
• Find the backward logic of Cm
• Map Cm to the HED of the most similar reserved carry from previous column
FA
4-bit multiplier
Experimental Setup and Results
• F1 = A*B; F2 = C*D+3E+120; F3 = A*B+C*D– A, B : 32-bit C,D,E,F1 : 64-bit F2,F3 : 129-bit
Can deal with large (64, 128) bit arithmetic functions for efficient verification and debugging
Outline• Overviews of our verification research activities
regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph
(ExSDG)• Equivalence checking as well as static/model
checking with ExSDG• Post-silicon verification method through mapping
chip traces with ExSDG• Verification/debugging methods for large arithmetic
circuits• Automatic generation of on-chip bus protocol
transducers
71
Communication parts can be very complicated
• Need interface bw protocol A and B• Protocols can be very complicated
– Over 30 different commands defined in the state-of-the-art protocols (OCP)
– Manuals over 200 pages– Bust, out-of-order modes, …
• We have developed an automatic generator of protocol transducers– Protocols are formally defined with
FSM/automaton– State-of-the-art protocols can be dealt with
in a couple of minutes– Now being extended to: On/off-chip bus interconnections Formal verification of interfaces
Protocol B
MPEG RAM
CustomHWDMAC
RAM(IP)
DMAC(IP)
CPU(IP)
Trans-ducer
Protocol A
Protocol A
…
Protocol B
…
How protocol transducer is realized
• Intuitive understanding of the problem– Follow the two protocols compute the product of ⇒
the two FSM/automata
ProtocolA
Master
ProtocolB
Slave
Request Request
Response Response
設計対象
Exploration[1] + ours
Definition of protocol
Protocol A
Protocol BProtocol transducerIn FSM/automaton
(stb==1)ack<=0
ack<=1
ack<=0Clock-wisebehavior
[1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli,“Automatic Transducer Synthesis of Interfaces between Incompatible Protocols” ,DAC’98 pp.8-13
Simple computation of products
• Protocol definition automaton should not have any loops– Even in the same state, data values are different– Expansion can be infinite
A D
A E B D B E
B FC D C E C F
C E C F
C F
C
B
A
F
E
D
ProtocolA
Master 8
ctrldata
ProtocolB
Slave8
ctrldata
{Ctrl=0}{Ctrl=1, data1}
{Ctrl=1, data1}{Ctrl=0, data2}
{Ctrl=0, data2}
Transducer
Transducer
{Ctrl=0}{Ctrl=1, data1}
{Ctrl=0, data2}{Ctrl=0}
Invalid due to Illegal dependency
Unavoidable path
Some paths are avoidable
Minimum latency path
8
8
Invalid due to Illegal dependency
Need separation between comp. and comm. !
Separation of computation and communication inside protocol transducers
• In protocol definition, control and data are separately specified
• Introduce two FSMs for request and control to describe complicated protocols uniformly
• FIFO can be made arbitrary complicated if we like
ProtocolA
Master
ProtocolB
SlaveRes. Res.FSM
Transducer
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.Transducer
Res.FSM
Even arithmetic computation
possible
Req.
Req.FSM
Protocols can be very complicated• State-of-the-art protocols introduces many
features for faster throughputs
ProtocolMaster
ProtocolSlave
Request(Address / Data)
Response(Data)
t
Split transaction( Non blocking )
Req1 t
Out of ordertransaction
Req2
Req3 Res1
Res2
Req1
Req2
Req3
Res1
Res3
Res2
Bursttransaction
t
Addr1
Addr2
Addr3
Data1
Data2
Data3
Data4Addr4
Request
Single addressBurst trans.
Addr1 Data1
Data2
Data3
Data4
Requestt
Req1 → Res1
Req2 → Res2
t Blocking( Low throughput )
For more complicated protocols…
入力オートマトン
Protocol A Protocol B
Req.
Res.
Req.
Res.
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.SendFSM
Req.
Req.FSM
RecvFSM
X ReqReq
X FIFOWR
ResXFIFORD
Res
Newly introduced FIFO
Transducer
Pros: Can deal with more complicated protocols
Cons: Need more latency delay due to multiple FIFO
Control for FIFO
Read Write
Now we can resolve it • Elimination of loops
(to initial states) 。• Elimination of
intermediate loopsi
A
B
i
C
D
i
A
B
i
C
D
e
e
Exp
lora
tion
i
Y
e
X
Z
U
W
i
YX
Z
U
WIntroductionof ending state
Eliminationpf ending states
SS = Loops are replaced with super states
Exp
lora
tion
Exp
lora
tion
[2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, “Protocol Transducer Synthesis using Divide and Conquer approach, “ Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp.280-285, 2007.
• Concentrating on controls only– Date parts are processed separately !
[2]
How to deal with multiple complicated transactions
• A protocol is a collection of sequences• Each sequence can operate independently
– True for state-of-the-art protocols with separation between computation and communication
Protocol
Sequence1
Sequence2
Sequence3
Sequence4
Hardwaredefinition
( Read )
( Write )
( 4 burst read )
( 4 burst write )
Automaton1
Port, signal names, etc.
Automaton2
For request orblocking
For response
i(stb==1)ack<=0
ack<=1
ack<=0
All sequences share initial state
・・
・
[2]
Hierarchical synthesis owing to comp. and comm. separation
ProtocolA
ProtocolB
Transducer
Partial transducer1
Partial transducer2
SequenceA2
SequenceB1
SequenceB2
グラフ探索
グラフ探索
SequenceA1
Exploration
Exploration
ii
i
+ =
Merge generated FSM with the same initial state
Sequence level synthesis followed by merge process
[2]
Tool implementation• Planned to be distributed freely from OCP-IP
Experimental results• Atholon64 2GH z + 1GB RAM• Implemented as over 12,000 loc in C++
– Input: Hierarchical automaton descriptions in XML– Output: RTL synthesizable Verilog
• Logic synthesis: Xilinx ISE• RTL simulator: Model Sim XE
Mater'sProtocol
Slave'sProtocol Type Sequences Synth.Time Gate
countsOCP AHB (NB,BK) 4 1.1[s] 2,352AHB OCP (BK,NB) 4 1.3[s] 1,843OCP OCP (NB,NB) 2 1.9[s] 1,568
OCP Tagged OCP (NB,OoO) 2 2.2[s] 3,514
Tagged OCP AXI (OoO,OoO) 2 4.8[s] 1,377
AXI OCP (OoO,NB) 2 4.9[s] 1,731OCP AXI (NB,OoO) 26 257.8[s] 61,205