Download - Verification and debugging of hardware designs utilizing C-based high- level design descriptions

Verification and debugging of hardware designs utilizing C-based high- level design

descriptions

Masahiro FujitaVLSI Design and Education Center (VDEC)

University of Tokyo

System level design flow

conception

Functional decomposition

Function-SeparatedDescription

ArchitecturalExploration

HW1SW

Performance AnalysisHW/SW partitioning

ComponentLibraries

StructuralDecomposition

HW2

High Level Synthesis

HW3

Ideas

Components

Design as aMeeting Point

Automatic / Manual

High Level / Transactional

RTL

C/TLM

C/HL

RTL to RTLOptimization

RTL Synthesis

• Typical design flow used in industry• Mostly C language based design descriptions for

embedded system and SoC designs

(Electric) System level design tools: Elegant

Traditional design starts here…

System level design

Algorithm design

HW/SWpartitioning

Communicationnetwork design

Interface design Hardware

implementationdesign (RTL)

Softwareimplementation

design

A

B

C

D

Inte

rfac

e ci

rcui

ts

Proc

esso

r mod

el

Devi

ce d

river

SW HW

A ?B ?

C ?

D ?A B

C D

HW ？

HW ？

SW ？

SW ？

A

B

C

D

Prot

ocol

Prot

ocol

SW HW

3

SystemC/SpecC

• Joint development by JAXA, Toshiba, NEC, Fujitsu, UC Irvine, and U. of Tokyo (formal verification)

• Help assigning functions onto componentsStatic/model checking, Equivalence checking

JAXA: Japanese space exploration agencyElegant tool has actually been used for space satellites

Outline• Overviews of our verification research activities

regarding to C-based high-level design descriptions– Based on Extended System Dependence Graph

(ExSDG)• Equivalence checking as well as static/model

checking with ExSDG• Post-silicon verification method through mapping

chip traces with ExSDG• Verification/debugging methods for large arithmetic

circuits• Automatic generation of on-chip bus protocol

transducers

System Dependence Graph

int i = 1;main(){ int a = 0; int b = 0; a = a + i; if(a == 1) b = a++;}

main

a = 0;b = 0;

a = a + i;

if(a==1)

b = a++;

a++

i = 1;

int i int bint a

Control dependence

DependencyAnalysis

Data dependence

Declaration dependenceSystem Dependence Graph (SDG)

• Sufficient representation since verification methods for net-list are applicable to SDG

• Problems– Existing SDGs are too complicated (# nodes/edges are huge)– Not directly corresponding to abstract syntax trees

Extended Syntaxes in ExSDG from C

・・・wait(a);・・・・・・・・・notify(b);・・・

・・・・・・notify(a);・・・wait(b);・・・・・・

Hierarchical Structure a b c

par{ a.main(); b.main(); c.main();}

Concurrency

Synchronization

bit a[3:0];

a

bit b[3];

b

a[2:0]@b[1]

Bit Vector

Timed Behavior

buffered bit[1] a;a = 1;waitfor(1);waitfor(1);a = 0;waitfor(1);

clka

Module

Sub Module 1

Sub Module 2

port

Translation Flow to ExSDG

SpecCDesign

VerilogDesign

VHDLDesign

ExSDG

SystemCDesign

System Level/ Behavior Level

RTL

System-VerilogDesign

AST(TBL)

AST(UBL)

AST(RTL)

Untimed Behavior Level: ・ No notion of time ・ For software design or behavior specification

Timed Behavior Level: ・ Including timing specification ・ For design with timing estimationRegister Transfer Level: ・ Cycle accurate ・ For hardware design

Simplifi-cation

Complex or redundant syntaxes are removed ex) switch statement

Three Design Stagesmodule TEST( int in1, int out1, event e) { Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } wait(e); }};Untimed Behavior Level:・ Concurrency and Synchronization・ Times expressions and buffered/wire variablesare prohibited

module TEST( int in1, int out1 event e){ Proc p1(); Proc p2(); void main(){ par{ p1.main(); p2.main(); } waitfor(1); wait(e); }};Timed Behavior Level:・ Untimed Behavior Level + Timed expressions・ buffered/wire variables are prohibited

module TEST( wire int in1, wire int out1){ buffered int a; void init(){ out1 = a; } void run_one_cycle(){ a = in1; } main(){}};Register Transfer Level:

・ par and wait/notify are prohibited・ init(): Wire connections are declared・ run_one_cycle(): Executed per clock

Edges in ExSDG• Control Flow Edge• Data Dependence Edge• Control Dependence Edge• Declaration Dependence Edge• Parameter In/Out• Summary Edge (Interprocedural Dependence

Edge)• Parallel Edge• Communication Edge• Port Reference Edge

Concurrent and Synchronization Dependence

int x, y;event e1,e2;void main(void){ y = 0; x = 5; par{ func1(); func2(); }}

void func1(){ y = x; notify(e1); wait(e2); x = y;}

void func2(){ wait(e1); y++; y = y * y; notify(e2);}

Process 1 Process 2

Parallel Edge

par

func1(); func2();

end

notify(e1);

Communcation Edge

wait(e1);

wait(e2); notify(e2);

Hierarchical Dependencemodule M1(int in, int out, event e1){ void main(void){ wait(e1); out = in * in; }};

module M3(int inout){ int w1; event e1; M1 m1(w1, inout, e1); M2 m2(inout, w1, e1); void main(void){ par{ m1.main(); m2.main(); } }};

module M2(int in int out event e1){ void main(void){ out = in + in; notify(e1); }};

int M3::w1; event M3::e1; int M3::inout;

int M::in; int M1::out; event M1::e1;

int M2::in; int M2::out; event M2::e1;

Port Reference Edge

Experimental Results• Compared with a SpecC SDG generation method using

CodeSurfer• Performed on a PC with Xeon 3.2GB and 2GB memory

＃ of nodes # of edges

ExSDG 453 2061

SDG generated by CodeSurfer 6380 48073

Example NoL CodeSurfer ExSDGIDCT 420 8.7s 5.2s

Elevator 3055 9.1s 222.2sMPEG4 5657 30.4s 902.9s

SDG Generation Time

Num. of Nodes and Edges in IDCT Example







transducers

Algorithmic designdescription

High level synthesizabledescription

Register Transfer Leveldescription

Many steps of manual refinement

Many steps of manual refinement

Design optimization

High levelsynthesis

Design optimization

Design optimization

Static/Model checking

Sequential Equivalence Checking

From the viewpoint of verification• Keep entire descriptions as correct as possible• Static/Model checking each description• Equivalence checking between two descriptions• Besides simulation, formal methods should be applied

Basic Procedure• Symbolic simulation

– Generates a set of equations from designs– Every variable/operation is an uninterpreted symbol– Every expression is a formula of symbols

• Equivalence Class (EqvClass)– A class containing all equivalent expressions/variables– Generated during symbolic simulation based on

• Assignment statement• Substitution with equivalent expressions/variables

• Solving with SMT solvers– If generated equations have to be solved to prove the desired

equivalence, we use SMT solvers to interpret arithmetics– Public ones and our own solves

Examplea = v1;b = v2;add1 = a + b;Description 1

add2 = v1 + v2;

Description 2

4 EqvClasses are generatedfrom 4 assignments

E1 (a, v1)E2 (b, v2)E3 (add1, a+b)E4 (add2, v1+v2)

E1 (a, v1)E2 (b, v2)E3’ (add1, a+b, add2, v1+v2)

E3 and E4 can be merged bysubstituting aa with v1v1, bb with v2v2

Representing Symbolic Expressions: Maximally Shared Graph

x0=a+b;if x0<0 then x1 = -x0; y1 = y0;else x1 = x0; y1 = y0+x0;assert(0<=y1);

> –

0

ITE

<=

condtrue

false

y0

+

x1 y1

VC:

ITE

false

cond

true

+

a b

x0

• Linear-sized representation– Mathematically equivalent to

standard logical representation

• Advantages– Structure explicit

[flow of data in the graphcorresponds to flow of data in the program]

– Simple slicing

• No structural redundancies• Not functionally

canonical– Practical trade-off

Maximally Shared Graph

> –

0

ITE

<=

condtrue

false

y0

+

x1 y1

VC:

ITE

false

cond

true

+

a b

x0

Problem and Our Approach• Symbolic simulation cannot be applied to a

whole large designs– Because of the path explosion problem– Approach: Localizing the areas that are

symbolically simulated utilizing differences

return a; return a; return a; return a;

: difference

Equivalence Checking FlowSeq.

desc1Seq.

desc2

Identification of diff.

Is there anydiff. left?

Decision of initial verificationarea and its inputs/outputs

Verification

“Equivalent”No

Yes

Not eqv

Extension of the Areaand decision of itsinputs/outputs

Eqv

VerificationEqv

Not eqv

Reach to primaryinputs/outputs?

Yes“Not equivalent”

No

Verification Area and its Input/Output• Verification area: A set of statements• Input variables

– Used in the area, and assigned outside• Output variables

– Assigned in the area, and used outside• Verification problem for the area

– Are all pairs of output variables with the same name are equivalent?

– Using proved equivalences of the input variables

Example

• Initial verification– Input … b0, in3– Output … a1– Result … Equivalence cannot be proved

a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + in3;out0 = a1 * c0;

a0 = in1;b0 = a0 + in2;c0 = 0;a1 = b0 + 2 * in3;out0 = a1 * c0;

Forward extension from a1

Diff

(※) in1, in2, in3 are theprimary inputs of boththe descriptions

• 2nd verification– Input … b0, c0, in3– Output … out0– Result … Equivalence cannot be proved



Backward extensions from all inputs(forward extension cannot be applied any more)

Diff


Example

• 3rd verification– Input … a0, in2, c0 (= 0), in3– Output … out0– Result … Equivalent (due to c0 = 0)



Though the different part are not equivalent,the primary output is equivalent

Diff


Example

Implementation• FLEC: An Framework for Verification, Debugging

Support, and Static Checking– Several engines developed by ourselves

• Symbolic simulator, difference extraction, input pattern generation, static deadlock checker, slicing, …

– Dependence graph-based internal representation of system-level designs

• A number of APIs are provided to help development of engines

• ExSDG (Extended System Dependence Graph)– Designs are represented in form of ExSDG in FLEC– Frontend from SpecC to ExSDG is already developed

FLEC StructureSpecC

Design1(.sc)

SpecCDesign2

(.sc)

ExSDG(.fls)

ExSDG(.fls)

ExSDG Generation

Eqv. Classes

Eqv. Spec

Equivalencechecked Verification procedure

(Applied Method & Order)

SymbolicSimulator

SequentializingParallel Behaviors

Diff ExtractionRule-basedEngine

SMT solvers

Result

Control Control

Sequential equivalence checking• Definition of equivalence

– Intension of designers, management of differences

• For efficient checking:– Identification of matching states Bounded equivalence checking between matching states– Identification of equivalent internal points

• Use of SMT solvers

align

inputs

align

spec

impl

1

outputs

reset

reset

clocks Verifyfalsified

counterexample

spec

impl

proven

bounded or fullproof

Transactions : State View

• Encapsulates one or more units of computation for the design being verified

• Self-contained since it brings the machine back to a synchronizing state

Refinement mapping

SLM

RTLRTL transaction

SL transaction

Transaction : Memory

• Design 1 transaction : a single memory read/write occuring in a single cycle

• Design 2 transaction: single memory read/write (potentially) happening over multiple cycles

MemADDR

DATA

RD WR

OUT

Design 1

Design 2MemCache

Cache ctl

ADDR

DATA

RD WR

OUT

Transaction-Based Equivalence Checking

• The states in RTL that correspond to states in system-level model (SLM), are referred to as synchronizing states

• Based on this, definitions of equivalence are generated manually

• The total equivalence is based on inductions on synchronizing states

SLM

RTL

Refinement mapping

Transient states

Complete Verification

or

Sequentialcounterexample

Equivalence Specification• Equivalence is specified by

– (Port, Throughput, Latency, Condition)

behavior Adder1(in int in1, in int in2, in int in3, out int out1) { void main() { out1 = in1 + in2 + in3; }};

(in1, 1, 0, TRUE)(in2, 1, 0, TRUE)(in3, 1, 0, TRUE)(out1, 1, 0, TRUE)

behavior Adder2(in int in1, in int in2, in int in3, out int out1) { void main() { int tmp; while(1) { tmp = in1 + in2; waitfor(5); tmp = tmp + in3; waitfor(5); out1 = tmp; } }};


behavior Adder3(in int in1, out int out1) { void main() { int tmp; while(1) { tmp = in1; waitfor(1); tmp = tmp + in1; waitfor(1); tmp = tmp + in1; waitfor(1); out1 = tmp; waitfor(1); } }};


Sequentialization• Concurrent behaviors are sequentialized• If st1 and st2 running concurrently are “write-write” or “read-

write” relation, check the following properties:– P1: always T(st1) > T(st2)always T(st1) > T(st2)– P2: always T(st1) < T(st2)always T(st1) < T(st2)– T(s) … execution time of a statement s

• The checks are based on ILP– Is there any assignment satisfying P1 (P2) ?– With the timing constraints generated from SpecC designs

(P1, P2) = (pass, pass) Impossible(P1, P2) = (fail, pass) Always st1st2(P1, P2) = (pass, fail) Always st2st1(P1, P2) = (fail, fail) Order is undecidable

Can besequentialized

Sequentialization

a = 10;b = 10;c = a + b;

x = 20;y = 20;z = x + y;

No dependence

a = 10;wait e;c = a + x;

x = 20;notify e;y = 20;z = x + y;

Synchronized

x = 10;a = 10;c = a + x;

x = 20;y = 20;z = x + y;

Not synchronized

No check is needed

a = 10;b = 10;c = a + b;x = 20;y = 20;z = x + y;

always x=20 always x=20 c=a+x? c=a+x? Result: YESalways x=a+x always x=a+x x=20? x=20? Result: NOCan be sequentialized!!

a = 10;x = 20;y = 20;z = x + y;c = a + x;

always x=10 always x=10 x=20? x=20? Result: NOalways x=20 always x=20 x=10? x=10? Result: NOCannot be sequentialized!!

For HW/SW co-design

Softwarepart

RTLC

Transformto FSMD

Abstractionon

Comm.

SW(FSMD)

HW(FSMD)

IdentifySynch.points

SequentializationRecude# states

HW+SW(FSMD)

Model checking

Equivalence checking

Case Study 1: MPEG4• Difference between designs

– Constant propagation, constant folding, common sub-expression elimination in IDCT function

• Design size– About 6300 lines in SpecC– About 50k nodes and 36k edges in ExSDG

• ExSDG generation time: 780 sec

Nodes in diff

Result Run time # of ext.

MPEG4_org MPEG4_rev1 96 Eqv 3.3 sec 0MPEG4_org MPEG4_rev2 96 Not eqv 13.2 sec 80

Case Study 2: Elevator Controller• Difference between designs

– Speculative code motion in control paths• Design size

– About 3300 lines in SpecC– About 20k nodes and 20k edges in ExSDG

• ExSDG generation time: 178 sec

Nodes in diff

Result Run time # of ext.

Elv_org Elv_rev1 4 Eqv 1.8 sec 1Elv_org Elv_rev2 3 ---- > 12 hours 4

Rule-based Equivalence Checking

Checks the equivalence between two high-level (e.g. SpecC) design descriptions

• Assuming the equivalences of variables, equivalence rules are applied in a bottom-up manner– Equivalence rules are defined in terms of static dependence

relations and control flows• Verification result is either "equivalent" or "cannot

prove the equivalence"– It cannot prove that they are not equivalent– Equivalence rules are heuristically picked up

Equivalence Rules (1/3)Rule 1: Expression• Checks the equivalence considering the

commutative, associative, distributive laws

d * (b * a + c)d * a * b + c * d

Modified designOriginal design

*

b aCommutative law

Distributive law +

c

*

d

*

a b

*

d

+

*

c d

Equivalence Rules (2/3)Rule 2: Assignment• The variable in LHS is equivalent to RHS until the

variable is re-assigned

{ return a - b;}

{ int c = a - b; return c;}

Original design Introducing an intermediate variable

ー

a b

return

ー

a b

return=

c c

Rule 1

Rule 2

Equivalence Rules (3/3)Rule 3: Sequential composition• Execution order can be changed unless it destroys

the data dependence relations

{L1: c = a + b; L2: d = a + c; L3: e = b + c;}

Original design

{L2: d = a + c; L1: c = a + b; L3: e = b + c;}

Swapping L1 and L2

{L1: c = a + b; L3: e = b + c;L2: d = a + c; }

Swapping L2 and L3

seq

L1 L2 L3

seq

L2 L1 L3

seq

L1 L3 L2

Example: Bottom-Up Application of Rules

{ c = a - b; f = d + e;}

{ f = e + d; c = a - b;}

Original design Modified design

ー

a b

=

c +

d e

=

f

seq

+

e d

=

f

a b

=

c

seq

ー

Internal equivalences

Rule 1

Rule 2

Rule 3

A Known Issue of Rule-based Checking

How can we find internal equivalences?• Our initial method finds them by "name"

– That is, all variables with the same name are identified to be equivalent

• This approach fails the equivalence checking when variable names change– Typically, variable names are changed through

design transformations– If variables in different places have the same name,

the result may be false positive

Examples of Checking Failure Though following examples are all equivalent,

name-based equivalence checker fails

int ex1(int a, int b) { return a - b;}

int ex2(int b, int a) { return b - a;}

Original design Swapping the variable names

int ex3(int c, int d) { return c - d;}

int ex4(int a, int b) { int c = a - b; return c;}

Modifying the variable names Introducing an intermediate variable

Identifying Potential Internal Equivalences• Perform a random simulation, then identify a set of

variables having the same signature (i.e. sequence of simulated values)– Well-known technique in RTL verification

• In RTL, the values of registers are uniquely known at every cycle

– However, there is no concept of "cycle" in behavioral-level design descriptions

• The concept of "a variable with context" is introduced

e = a - b;f = a + b;

x = b + a;y = a - b;

a=(35,-4,712)b=(-220,1151,-3)

e=(255,-1155,715)f=(-185,1147,709)

x=(-185,1147,709)y=(255,-1155,715)Random Pattern

Design 2

Design 1

Signatures

{e, y}{f, x}

Potential Internal

Equivalences

Sim

ulat

ion

Internal Equivalences at Behavioral LevelDefinition of a cycle• A period of the execution from accepting an input pattern

to generating a set of output values• Internal variables must be assigned once in a cycle

– Designs are in static single assignment (SSA) form– The concept of "a variable with context" is introduced for multiple

instantiations of modules and function calls

A variable with context• Context: a runtime path information from the top-level

module to the current function• Guaranteed to be assigned only once in a cycle

Method Based on Potential Internal Equivalences

Issue: Potential internal equivalences may include false equivalences

• False equivalences may lead to false positive results• Non-equivalent variables may be identified as equivalent

– Equivalent variables are always identified as equivalent– True equivalences are included in potential equivalences

Solution: Explores all possible subsets of internal equivalences when applying the rules

• Still practical since each set of internal equivalences is typically small

ImplementationImplemented in C++ on top of our system-level

verification framework FLEC– Given a SpecC design description as an input, ExSDG

representation is generated by parser and dependence analyzer

• Potential internal equivalence identifier– Generates SpecC description from ExSDG representation

as well as a random input pattern generator module– Compiles the random simulator using SpecC reference

compiler• Rule-based equivalence checker

– Each rule is implemented as a callback function– Given two nodes in ExSDGs, the equivalence between two

sub-trees are checked

Case Study: A Practical Design

Example: Two IDCT designs before and after parallelization

• Column & row processes are parallelized• Existing method failed checking since it could

not find the correspondences between variables• Proposed method checked the equivalence

– 10 cycles of random simulation– Runtime: ~3 seconds

• Mainly compilation and execution time of simulator







transducers

Debug HardwareDebug Hardware

• Event ExtractorEvent Extractor– Extract basic required Extract basic required

informationinformation• Transaction type (read/write)Transaction type (read/write)• Start of transactionStart of transaction• End of transactionEnd of transaction• Initial and target address of Initial and target address of

transactiontransaction

• Trace BufferTrace Buffer– Store extracted transactions Store extracted transactions

in a compact formin a compact form

initiator channel target

…put(req)

get(req)

put(res)get(res)

true/falsetrue/false

true/falsetrue/false

master bus slave

…

m_request

m_select

opb_grant

opb_dbus

…

s_dbus

opb_xferack

s_xferack

Post Silicon Debug with ExSDGPost Silicon Debug with ExSDG

channelModule 1

…

Module k

Transaction Extractor

Trace Buffer

Software

Read Out

Initial System Debug SWDebug HW

Focus on communication parts of designs Establish mapping between ExSDG and chip traces

Debug Flow:1- Extract basic transactions2- Store in a trace buffer* Run system until a failure *3- Read the trace buffer 4- Analyze traces with software

Transaction type (read/write)Start & end of transactionInitial and target of transaction

Find wrong behaviors using debug patterns

Potential racePotential deadlock

Store extracted transactions

On-chip bus,Network, …

1

23

4

ExSDG for C

Post Silicon Debug: some resultsPost Silicon Debug: some results OPB Bus: 1966 gatesOPB Bus: 1966 gates PLB Bus: 2206 gatesPLB Bus: 2206 gates

• Hardware overhead is low• Trace buffer not large

– Trace buffer fields: 1 or 2 bytes per transaction

• Debug patters as assertions

• Analysis is quick: 20 seconds for 100,000 transactions– Working with ExSDG (C descriptions)

Master ID Slave ID Address R/W Command Tag

assert neverSoTr(m1, s1, Wr, -, t1) ; SoTr(m2, s1, Wr, SAME, t2) ;EoTr(m1, s1, Wr, SAME, t1)| EoTr(m1, s1, Wr, -, -) ; SoTr(m2, s1, Wr, SAME, -)filter (*,*,*)

1) Master1 locks first semaphore.2) Master2 locks second semaphore3) Master1 waits for second semaphore4) Master2 waits for first semaphore5) Steps 3 and 4 are repeated

Our approach to debugging high-level• Concrete simulation

– Depth-first search: long range, narrow width

• Formal method: Symbolic state traversal– Breadth-First Search: short range,

wide width

• Our approach: user-driven BFS combines DFS and BFS– DFS: To collect reachable states– BFS: To search exhaustively

around states of interest (user specified)

– User switches DFS and BFS using various commands including execution path specification

FF

FFFF

Reachable statesReachable states

InitialInitialStatesStates

Faulty states

DFS

DFS

FFFF

Reachable Reachable statesstates


FF

BFS

FF

FFReachable statesReachable states


DFS

FF

jump BFS

BFSBFS

Infeasiblestates

Some experiences • Filter design from a company

– 170 LoC in SpecC, part of buggy real design– BMC cannot detect the bug (simply too deep)– Designer specifies a set of execution paths that he has concerns

• Concern: some portions of the codes in particular sequences may not work

• Successfully generate 61 cycles pattern with the proposed approach

• Elevator controller– 9500 LoC in SpecC– Target assertion

• Door must open within 30 cycles after pushing up/down button while elevator is stopping on a floor

– BMC failed at 120th cycle (after >10 days run)– Looks like there is no such issue from BMC

• User-guided analysis realizes failure !







transducers

Data-path Dominated Applications

MATLAB Model(Fixed point)

Refinement and Optimization

RTL Model(Fixed bit-

width)

Gate-level Model

(bit level)

Physical Design

Equivalence Checking Modular-HED

Representation

Arithmetic

Bit Level

Floating pointModel

Automatic Fixed-point Generation

Real Number Specification

Debugging

RTL Synthesis

HED Representation Polynom

ialOptim

ization

Horner Expansion Diagram (HED)• Horner Expansion:

– Const (Left) child (dashed line)– Linear (Right) child (solid line)

linearconst fxfyxf .,...),(

fconst

x

f(x)

flinear

x

f(x,y,z)

x

y

z

10 10

z

-42

-4z+2xy+

z

z y

• Example– f(x,y,z) = x2y+xz-4z+2; Order: x>y>z– f(x,y,z) = [-4z+2]+x[xy+z]

= fconst+x.flinear

• fconst = -4z+2 = f(z)

• flinear =f1(x,y,z)= xy+z

– f1(x,y,z) = xy+z = z+x[y]• f1const = z; f1linear = y

– Horner form• f = x(x(y)+z)+z(-4)+2

High-level Polynomial Datapath VerificationModular Equivalence checking

– Anti-aliasing function– Expand into Taylor series

– Implemented as a fixed size datapath • F1[15:0], F2[15:0], x[15:0]

)2(1))(2(1 22 xbaF

MAC

x = a2 + b2

coefficients

a b

x

F

Reg

coefficients

• F1 = 156x6 + 62724x5 + 17968x4 + 18661x3 + 43593 x2 + 40244x + 13281

• F2 = 156x6 + 5380x5 + 1584x4 + 10469x3 + 27209 x2 + 7456x + 13281

• F1 ≠ F2 over Z• F1[15:0] = F2[15:0] mod 216

6485

3281

64279

1675

64115

329

641

2

3456

xx

xxxxF

Modular-HED (M-HED)• The Smarandache function in number theory is

defined for a given positive integer b as the smallest positive integer such that its factorial is divisible by b – Example, the number 8 does not divide 1!, 2!, 3!, but

does divide 4! (4!/8 = 3), so S(8) =4 • 1*2*3*4 % 8 = 0• 5*6*7*8 % 8 = 0• 100*101*102*103 % 8 = 0

– A product of 4 consecutive numbers is divisible by 8• x(x+1)(x+2)(x+3) 0 mod 8• x(x+1)(x+2)(x+3) = x4 + 6x3 + 11x2 + 6x can be freely added

or subtracted under mod 8 !• Co-efficients are modified transforming given polynomials

to normal forms

)!(bS

Modular-HED (M-HED)• Vanishing Polynomial:

g(x) = 0 mod n • If we can factorize a polynomial g(x) into a

product of S(n) consecutive numbers, then it can be reduced to 0 in – over– S(24) = 6

• f(x) = (x+1)(x+2)(x+3)(x+4)(x+5)(x+6) 0 mod 16 – It can be reduced to ZERO!

)(

1)(nS

iix

nZ

7201764162473517521)( 23456 xxxxxxxf 42Z

Preliminary Experimental ResultsBenchmark

Specs Modular-HED

Method[9]

CUDD-BDD

*BMD miniSat [13] MILP [12]

Var/Deg/n Node/Time Time (s) Node/Time Node/Time Vars/Clauses/Time Time (s)

AAF 1 / 6 / 16 8 / 0.016 6.81 1.1M / 32.2 NA / >500 3.9K / 107K / >500 >500

D4F 1 / 4 / 16 6 / 0.031 4.95 27M / 20.3 NA / >1000 25K / 76K / >1000 >1000

CHEB 1 / 5 / 16 7 / 0.01 5.95 1M / 26.9 NA / >500 3.5K / 86K / >500 >500

PSK 2 / 4 / 16 16 / 0.032 13.48 NA / >500 NA / >500 52K / 142K / >500 >500

DIRU 2 / 4 / 16 9 / 0.016 14.4 NA / >1000 NA / >1000 10K / 30K / >1000 >1000

MI 2 / 9 / 16 26 / 0.2 17.5 23M / 39.4 NA / >1000 24K / 69K / >1000 >1000

SG 5 / 3 / 16 35 / 0.24 6.1 NA / >1000 NA / >1000 64K / 190K / >1000 >1000

QS 7 / 4 / 16 19 / 0.09 32.4 NA / >1000 NA / >1000 76K / 211K / >1000 >1000

NA: Not Applicable; K: Thousand; M: Million

Boolean reasoning methods never worksNeed to use word level techniques such as

the proposed one

High-level Polynomial Datapath OptimizationA partitioning and compensation heuristic

• A polynomial is given, how we can optimize it in terms of the number of adders and multipliers on fixed bit-width

• Our Solution– Apply Modular-HED

• To reduce over Z2n

– Partitioning approach• Poly = p1*p2 + p3

• Minimize p3

– Compensation approach• Compute Coefficients

Modular-HED

Partitioning

Compensation

Partitioning Heuristic

• poly = p1p2+p3 with unknown coefficients

• Minimize the cost of p3 poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y

+5z +3

• After Partitioningp1 = a1x2 + a2y + a3z;

p2 = b1x + b2y +b3;

p3 = w4 + 3

poly = p1*p2 + p3

Partitioning

poly

p1, p2, p3

• In our example poly = w4 – x3 –x2y -5x2 + 2xy +2y2 + xz + yz + 14y +5z +3p1 = a1x2 + a2y + a3z p2 = b1x + b2y +b3 p3 = w4 +3

• Set the coefficients (ai,bj) in order to achieve the minimum cost p3

• First consider all the equations produced by p1*p2 = poly - p3– a1b1=-1, a1b2=-1,a1b3=-5,a2b1=2,a2b2=2,a3b1=1, a3b2=1,

a2b3=14, a3b3=5– These equations may not have an answer!

Compensation Heuristic

Preliminary Experimental ResultsApplicatio

n Function M/V/D/nHorner Enhanced CSE Our Approach

#Gate Delay Time #Gate Delay Time #Gate Delay Time

Graphics Cosine-Wavelet

9/2/3/16 7850 23.7 0.04 5109 18.3 3.47 3678 18.5 6.52

Image Processing

Savitzky-Golay 1

10/2/3/16 7218 20.7 0.04 3757 22.7 4.75 2879 17.8 14.9

Image Processing

Savitzky-Golay 2

5/2/2/16 1697 20.8 0.03 1433 18.1 0.48 1057 16.2 1.63

Filter Quad 1 5/2/2/16 2737 17.7 0.03 2269 16.4 0.63 1763 17.1 1.78

Filter Quad 2 5/2/2/16 2569 16.4 0.03 2032 16.4 0.63 1571 15.3 1.56

Automotive Mibench 9/3/2/16 2058 15.7 0.04 2046 15.6 0.46 1303 14.1 8.75

Average Saving w.r.t Horner 0% 0% - 31% 6.3% - 49.2% 13.8% -

M: the number of monomialV: the number of variables

D: the maximum degreen: the number of bits (word-length)

Intensive optimization on polynomials greatly reduce the gerated custom circuits/software

Bit Level Adder (BLA) Model• In an arithmetic circuit several

addition processes are possible• However, each addition

can be represented with BLA model

• Represent A+2mB, by some custom full-adders and half-adders, which represent two cascade XORs and one XOR, respectively – Realization of the carry signals

is not uniquely modeled• BLA is robust for

– Carry-look-ahead– Ripple-carry– Carry-select– Carry-skip adders

The Proposed Debugging Algorithm• Partial product initialization

– Extract each column of 2i from S– Provide a bit-level ADD_SET with

different bit-orders– Bits in each column must be added

with each other while the generated carries are sent to higher order column

• Column-based XOR extraction– Search XORs over initial partial

product terms of each ADD_SET column

– At least one input from column 2i

– XOR extraction is performed without carry-logic blocks

• Much faster than other XOR extraction techniques

P1(2)P2(2)

P3(2)Unknown

The Proposed Debugging Algorithm• Carry-signal mapping using HED

– For each FA with no unknown input build a new reference (in HED)

– For each HA/FA with an unknown input Cm

• Find the backward logic of Cm

• Map Cm to the HED of the most similar reserved carry from previous column

FA

4-bit multiplier

Experimental Setup and Results

• F1 = A*B; F2 = C*D+3E+120; F3 = A*B+C*D– A, B : 32-bit C,D,E,F1 : 64-bit F2,F3 : 129-bit

Can deal with large (64, 128) bit arithmetic functions for efficient verification and debugging







transducers

71

Communication parts can be very complicated

• Need interface bw protocol A and B• Protocols can be very complicated

– Over 30 different commands defined in the state-of-the-art protocols (OCP)

– Manuals over 200 pages– Bust, out-of-order modes, …

• We have developed an automatic generator of protocol transducers– Protocols are formally defined with

FSM/automaton– State-of-the-art protocols can be dealt with

in a couple of minutes– Now being extended to: On/off-chip bus interconnections Formal verification of interfaces

Protocol B

MPEG RAM

CustomHWDMAC

RAM(IP)

DMAC(IP)

CPU(IP)

Trans-ducer

Protocol A

Protocol A

…

Protocol B

…

How protocol transducer is realized

• Intuitive understanding of the problem– Follow the two protocols compute the product of ⇒

the two FSM/automata

ProtocolA

Master

ProtocolB

Slave

Request Request

Response Response

設計対象

Exploration[1] + ours

Definition of protocol

Protocol A

Protocol BProtocol transducerIn FSM/automaton

(stb==1)ack<=0

ack<=1

ack<=0Clock-wisebehavior

[1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli,“Automatic Transducer Synthesis of Interfaces between Incompatible Protocols” ,DAC’98 pp.8-13

Simple computation of products

• Protocol definition automaton should not have any loops– Even in the same state, data values are different– Expansion can be infinite

A D

A E B D B E

B FC D C E C F

C E C F

C F

C

B

A

F

E

D

ProtocolA

Master 8

ctrldata

ProtocolB

Slave8

ctrldata

{Ctrl=0}{Ctrl=1, data1}

{Ctrl=1, data1}{Ctrl=0, data2}

{Ctrl=0, data2}

Transducer

Transducer

{Ctrl=0}{Ctrl=1, data1}

{Ctrl=0, data2}{Ctrl=0}

Invalid due to Illegal dependency

Unavoidable path

Some paths are avoidable

Minimum latency path

8

8

Invalid due to Illegal dependency

Need separation between comp. and comm. !

Separation of computation and communication inside protocol transducers

• In protocol definition, control and data are separately specified

• Introduce two FSMs for request and control to describe complicated protocols uniformly

• FIFO can be made arbitrary complicated if we like

ProtocolA

Master

ProtocolB

SlaveRes. Res.FSM

Transducer

ProtocolA

Master

ProtocolB

Slave

Req.

Res. Res.Transducer

Res.FSM

Even arithmetic computation

possible

Req.

Req.FSM

Protocols can be very complicated• State-of-the-art protocols introduces many

features for faster throughputs

ProtocolMaster

ProtocolSlave

Request(Address / Data)

Response(Data)

t

Split transaction（ Non blocking ）

Req1 t

Out of ordertransaction

Req2

Req3 Res1

Res2

Req1

Req2

Req3

Res1

Res3

Res2

Bursttransaction

t

Addr1

Addr2

Addr3

Data1

Data2

Data3

Data4Addr4

Request

Single addressBurst trans.

Addr1 Data1

Data2

Data3

Data4

Requestt

Req1 → Res1

Req2 → Res2

t Blocking（ Low throughput ）

For more complicated protocols…

入力オートマトン

Protocol A Protocol B

Req.

Res.

Req.

Res.

ProtocolA

Master

ProtocolB

Slave

Req.

Res. Res.SendFSM

Req.

Req.FSM

RecvFSM

X ReqReq

X FIFOWR

ResXFIFORD

Res

Newly introduced FIFO

Transducer

Pros: Can deal with more complicated protocols

Cons: Need more latency delay due to multiple FIFO

Control for FIFO

Read Write

Now we can resolve it • Elimination of loops

(to initial states) 。• Elimination of

intermediate loopsi

A

B

i

C

D

i

A

B

i

C

D

e

e

Exp

lora

tion

i

Y

e

X

Z

U

W

i

YX

Z

U

WIntroductionof ending state

Eliminationpf ending states

SS = Loops are replaced with super states

Exp

lora

tion

Exp

lora

tion

[2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, “Protocol Transducer Synthesis using Divide and Conquer approach, “ Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp.280-285, 2007.

• Concentrating on controls only– Date parts are processed separately !

[2]

How to deal with multiple complicated transactions

• A protocol is a collection of sequences• Each sequence can operate independently

– True for state-of-the-art protocols with separation between computation and communication

Protocol

Sequence1

Sequence2

Sequence3

Sequence4

Hardwaredefinition

（ Read ）

（ Write ）

（ 4 burst read ）

（ 4 burst write ）

Automaton1

Port, signal names, etc.

Automaton2

For request orblocking

For response

i(stb==1)ack<=0

ack<=1

ack<=0

All sequences share initial state

・・

・

[2]

Hierarchical synthesis owing to comp. and comm. separation

ProtocolA

ProtocolB

Transducer

Partial transducer1

Partial transducer2

SequenceA2

SequenceB1

SequenceB2

グラフ探索

グラフ探索

SequenceA1

Exploration

Exploration

ii

i

＋＝

Merge generated FSM with the same initial state

Sequence level synthesis followed by merge process

[2]

Tool implementation• Planned to be distributed freely from OCP-IP

Experimental results• Atholon64 2GH ｚ　 + 1GB RAM• Implemented as over 12,000 loc in C++

– Input: Hierarchical automaton descriptions in XML– Output: RTL synthesizable Verilog

• Logic synthesis: Xilinx ISE• RTL simulator: Model Sim XE

Mater'sProtocol

Slave'sProtocol Type Sequences Synth.Time Gate

countsOCP AHB (NB,BK) 4 1.1[s] 2,352AHB OCP (BK,NB) 4 1.3[s] 1,843OCP OCP (NB,NB) 2 1.9[s] 1,568

OCP Tagged OCP (NB,OoO) 2 2.2[s] 3,514

Tagged OCP AXI (OoO,OoO) 2 4.8[s] 1,377

AXI OCP (OoO,NB) 2 4.9[s] 1,731OCP AXI (NB,OoO) 26 257.8[s] 61,205