A provably tight delay-driven concurrently congestion mitigating global routing algorithm

Applied Mathematics and Computation 255 (2015) 92–104

Contents lists available at ScienceDirect

Applied Mathematics and Computation

journal homepage: www.elsevier .com/ locate /amc

A provably tight delay-driven concurrently congestionmitigating global routing algorithm q

http://dx.doi.org/10.1016/j.amc.2014.11.0620096-3003/� 2014 Elsevier Inc. All rights reserved.

q The reported study was partially supported by RFBR, Russia, research project No. 13-07-00139, and by Basic Science Research Progra2013R1A1A2064302) through NRF, Korea.⇑ Corresponding author.

E-mail addresses: [email protected], [email protected] (R. Samanta), [email protected] (A.I. Erzin), [email protected] ([email protected] (Y.V. Shamardin), [email protected] (I.I. Takhonov), [email protected] (V.V. Zalyubovskiy).

Radhamanjari Samanta a,⇑, Adil I. Erzin b,c, Soumyendu Raha a, Yuriy V. Shamardin b,Ivan I. Takhonov c, Vyacheslav V. Zalyubovskiy b

a Supercomputer Education and Research Center, Indian Institute of Science, Bangalore, Indiab Sobolev Institute of Mathematics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russiac Novosibirsk State University, Novosibirsk, Russia

a r t i c l e i n f o

Keywords:Steiner treeElmore delayGlobal routingGradient method

a b s t r a c t

Routing is a very important step in VLSI physical design. A set of nets are routed underdelay and resource constraints in multi-net global routing. In this paper a delay-drivencongestion-aware global routing algorithm is developed, which is a heuristic based methodto solve a multi-objective NP-hard optimization problem. The proposed delay-drivenSteiner tree construction method is of Oðn2 log nÞ complexity, where n is the number ofterminal points and it provides n-approximation solution of the critical time minimizationproblem for a certain class of grid graphs. The existing timing-driven method (Hu andSapatnekar, 2002) has a complexity Oðn4Þ and is implemented on nets with small numberof sinks. Next we propose a FPTAS Gradient algorithm for minimizing the total overflow.This is a concurrent approach considering all the nets simultaneously contrary to theexisting approaches of sequential rip-up and reroute. The algorithms are implementedon ISPD98 derived benchmarks and the drastic reduction of overflow is observed.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

The Global Routing Problem (GRP) in VLSI design is a problem of routing a set of nets (multi-net global routing) subject tolimited resources and delay constraints. There are various recent approaches for solving GRP [2,5,20,15,18,23] available, butthe referred methods are not timing-driven. Most of these modern routers generate Steiner trees with highly optimizedwirelength and then use rip-up and reroute iteratively for reducing the congestion. But merely optimizing the wirelengthand then minimizing the overflow will not produce a feasible routing because they will not necessarily meet timing atthe sinks. A simple example is shown in Fig. 1 to demonstrate that optimum wirelength does not necessarily mean optimumdelay and vice versa.

The computation of delay is heavily dependent on pendant subtrees. Therefore, optimizing delay and congestion is amulti-objective constraint, which is the focus of this work. In the example, Fig. 1(a) shows that, given 3 pins, 1(source),2(sink), and 3(sink), we first draw the Hanan grid by drawing horizontal and vertical lines through them. The intersecting

m (NRF-

. Raha),

http://crossmark.crossref.org/dialog/?doi=10.1016/j.amc.2014.11.062&domain=pdf

http://dx.doi.org/10.1016/j.amc.2014.11.062

mailto:[email protected]






http://dx.doi.org/10.1016/j.amc.2014.11.062

http://www.sciencedirect.com/science/journal/00963003

http://www.elsevier.com/locate/amc

https://www.researchgate.net/publication/3904435_Creating_and_exploiting_flexibility_in_Steiner_trees?el=1_x_8&enrichId=rgreq-19bcd675-9562-48cf-b3f4-daa26fd7b3e6&enrichSource=Y292ZXJQYWdlOzI3NjgzMjQwNTtBUzoyMzk3MzcxNzY4NTA0MzJAMTQzNDE2OTIwNTM2Mg==

https://www.researchgate.net/publication/221059286_BoxRouter_A_New_Global_Router_Based_on_Box_Expansion?el=1_x_8&enrichId=rgreq-19bcd675-9562-48cf-b3f4-daa26fd7b3e6&enrichSource=Y292ZXJQYWdlOzI3NjgzMjQwNTtBUzoyMzk3MzcxNzY4NTA0MzJAMTQzNDE2OTIwNTM2Mg==

https://www.researchgate.net/publication/4327322_A_new_global_router_for_modern_designs?el=1_x_8&enrichId=rgreq-19bcd675-9562-48cf-b3f4-daa26fd7b3e6&enrichSource=Y292ZXJQYWdlOzI3NjgzMjQwNTtBUzoyMzk3MzcxNzY4NTA0MzJAMTQzNDE2OTIwNTM2Mg==

https://www.researchgate.net/publication/221154464_MaizeRouter_Engineering_an_effective_Global_Router?el=1_x_8&enrichId=rgreq-19bcd675-9562-48cf-b3f4-daa26fd7b3e6&enrichSource=Y292ZXJQYWdlOzI3NjgzMjQwNTtBUzoyMzk3MzcxNzY4NTA0MzJAMTQzNDE2OTIwNTM2Mg==

R. Samanta et al. / Applied Mathematics and Computation 255 (2015) 92–104 93

points are the generated Steiner points. Fig. 1(b) shows the minimum wire-length(WL = 3 units) tree configuration andFig. 1(c) shows another tree configuration for the net, whose wirelength, WL = 4 units. Now, we compute the delay of thesink nodes for both the trees. In the figure, all Ri ¼ R and all Ci ¼ C. Though the tree in Fig. 1(b) has minimum WL(3 units),it has 3RC delay whereas tree in Fig. 1(c) has 4 units of WL but 2RC delays at the sinks as can be seen from the followingequations. For, Fig. 1(b),

Delay at node2; d12 ¼ R1 � C3 þ ðR1 þ R2Þ � C2 ’ 3RC;

Delay at node3; d13 ¼ R1 � C2 þ ðR1 þ R3Þ � C3 ’ 3RC;

and for, Fig. 1(c),

Delay at node2; d12 ¼ ðR1 þ R2Þ � C2 ’ 2RC;

Delay at node3; d13 ¼ ðR4 þ R3Þ � C3 ’ 2RC:

As suggested by Moffitt et al. [21], there is increasing demand of timing-driven routing algorithms and there are not manyworks focused in this area. There are a few global routing algorithms [14,30,29] based on MVERT [13], which consider timingbut they are of complexity Oðn4Þ (n is the number of sinks), and implemented on nets with small number of sinks. Also GRPwas formulated as a multi-commodity flow Problem [12] as well. With each net a certain flow of unit size is associated. Eachedge has a flow capacity. With respect to the objective function we may get a min-cost multi-commodity flow problem orconcurrent multi-commodity flow problem [27,3,1,10]. Meta-heuristics to solve GRP can be found in Timber-Wolf [8](simulated annealing), [4] (evolution algorithm), [9] (genetic algorithm) and [31] (tabu search). Fault tolerant routingmethod for network-on-chip is described in [17].We propose an Oðn2 log nÞ method [26] for constructing delay-driven Stei-ner trees. Another contribution of our work is a Gradient based approach for minimizing the overflow. The novelty of thisalgorithm is that, it considers all the nets concurrently and provides a fully polynomial time approximation scheme(FPTAS).

In Section 2, the problem is formulated, and Section 3 describes our proposed algorithm MAD (Modified Algorithm ofDijkstra) and its iterative modifications (IMAD). IMAD is applied to create minimum critical delay Steiner trees for eachnet. History based IMAD described in Section 5 is used to create congestion-aware trees for each net. Also we have useda router FLUTE [7] to generate another set of Steiner candidate trees for each net. Finally a gradient algorithm is used to pickone tree for each net from its candidate set of trees, such that the total congestion/overflow is minimal. The Gradient methodis described in Section 4.

Since there are no recent timing-driven router available, we run MAD without Gradient on IBM/ISPD98 benchmarks andthis gives us the initial congestion of the chip. Then we run IMAD with Gradient and show how effectively it reduces thecongestion. The benchmarks are modified by assigning resistance, capacitance values to the wires. We show that 66.4% trees

Fig. 1. Minimum wirelength does not necessarily mean minimum delay and vice versa.

94 R. Samanta et al. / Applied Mathematics and Computation 255 (2015) 92–104

picked by Gradient are generated by MAD and the rest are from FLUTE. The experimental results are shown in Section 7, andSection 8 gives the conclusion. The appendix gives the theoretical analysis of MAD.

2. Problem formulation

The problem can be formulated as a two-criteria problem as follows. In the global graph, it is required to find a set ofSteiner trees, each of which connects a subset of vertices, such that the total congestion overflow, and the maximum timingslack are minimal. The approximate solution to this problem is constructed in two stages. First we construct a set of timing-driven Steiner trees for each net based on Elmore delay, and then select one tree for each net, taking into account the densityof connections. Density of connection is the number of wires passing through an edge. It is used as a measure of congestion.

At the first stage, given an undirected graph G ¼ ðV ; EÞ , jV j ¼ m. To each edge ði; jÞ 2 E two non-negative parameters rij

(resistance) and cij (capacitance) are assigned. Consider the subset of vertices Sv ¼ f0;1; . . . ;ng# V , where vertex 0 is thesource of the graph (since the graph is undirected, it has no source and we call vertex 0 the source of signal) and nodesSv n f0g are the sinks or terminals. We call nodes from V n Sv intermediate nodes. Each sink i 2 Sv has the capacitance ci

and vertex 0 has also the resistance r0.Let us consider a Steiner tree T spanning Sv and rooted in 0. We are using the Elmore delay metric. Let k be an arbitrary

terminal in T and define Elmore delay as tkðTÞ. Let us denote:

� PkðTÞ - path from 0 to k in T. Also PuvðTÞ denotes path from u to v in T. Argument T is omitted where it is obvious fromcontext;� Tj ðj 2 VðTÞÞ is downstream subtree of T with root j; Te ðe 2 EðTÞÞ is downstream subtree of T rooted in the head node of arc

e;� CðHÞ ðRðHÞÞ - total capacitance (resistance) of subgraph H;CðHÞ ¼

Pe2EðHÞce þ

Pi2VðHÞci. Also we use notations:

Cj ¼ CðTjÞ;Ce ¼ CðTeÞ and Ruv ¼ RðPuvÞ.

The Elmore delay [22,25] along the arc ði; jÞ in T is defined as follows.

dij ¼ dijðTÞ ¼ rijcij

2þ Cj

� �: ð1Þ

The delay of signal propagation from source 0 to terminal k (delay along the path PkðTÞ) is given by

tk ¼ tkðTÞ ¼ r0C0 þX

ði;jÞ2PkðTÞdij: ð2Þ

The maximum among all the delays to terminals in T is called the critical delay and is denoted by t�ðTÞ. Terminalk : tkðTÞ ¼ t�ðTÞ is a critical terminal in tree T. In this paper, we discuss the sub problem of finding such a tree spanning Sin G that provides minimum critical delay:

maxi2S

tiðTÞ !minT: ð3Þ

This problem is known to be NP-hard [16]. The following section describes the proposed approximate algorithm, ModifiedAlgorithm of Dijkstra (MAD) for solving problem (3).

3. Algorithm MAD

Steps of the Algorithm:

Step 0. Set tree T ¼ ð0; ;Þ and delay t0 ¼ 0;Step 1. Find ði; jÞ ¼ arg minðu;vÞ2E;

u2T;vRTtuðT [ fu;vgÞ þ duvf g,

wheretuðT [ fu;vgÞ ¼ tuðTÞ þ r0ðcuv þ cv Þ þ

Pe2PuðTÞreðcuv þ cvÞ.

Set T ¼ T [ fi; jg and recalculate the delays tk ðk 2 TÞ:

� If j is an intermediate vertex, then set tj ¼ ti þ dij, and the delays tk in all the other vertices do not change;� if j is a terminal, then cut off all the pendent subtrees not containing terminal vertices and recalculate all the delays in T by

formula (2).

If not all terminals are included in T then go to Step 1.


3.1. Demonstration of MAD with an example:

Here we will show how MAD creates a rectilinear Steiner tree from a given set of points. In Fig. 2, the small circles 0, 1 and2 are the given terminal points of a net, which are also called sinks. To generate Steiner points for rectilinear geometry,vertical and horizontal lines are drawn through nodes 0, 1 and 2 resulting the grid structure in Fig. 2. The crosses at the inter-section of the lines are the candidate Steiner points. Now, we start from root node 0 with an initial trivial (the set of edges isempty) tree. We assume delay at t0 ¼ 0. To make the example brief and simple, we will not get into the details of the delaycalculation using the formulas. Instead we will assume the delay values at the nodes and generate the tree. Two edges (0, 3)and (0, 5) are going out from root 0. Let us assume, t0ðT [ f0;3gÞ þ d03f g < t0ðT [ f0;5gÞ þ d05f g. Therefore edge (0, 3) isselected by MAD. Since 3 is an intermediate (Steiner) vertex, just the delay at node 3 is updated as t3 ¼ t0 þ d03. Similarlyin the next two iterations, edges (0, 5) and (5, 6) are added to the tree and t5 and t6 are calculated. Next the minimum delayedge picked up by MAD is (5, 1). But the node to be added is 1, which is a sink. Therefore, all the pendent subtrees notcontaining terminal vertices are to be removed from the tree. The resulting tree in this iteration is shown in Fig. 3. Andnow the delays of the nodes in the tree have changed because subtrees of some of the nodes have changed. Therefore,t0; t5 and t1 are calculated using Eq. (2). Similarly edges (1, 7), (5, 6) and (7, 2) are added in the next three consecutive iter-ations and the delays of the added nodes t7 and t6 are calculated. Now the last node added to the tree is sink 2. Therefore,after cutting-off the redundant subtrees, the tree is as in Fig. 4. Since all the sinks are added to the tree, the iteration stops.

Time Complexity of Algorithm MAD: The generated set of Steiner points is reduced to OðcnÞ, where n is the number ofsinks/terminal points and c is a small factor. The reduction is based on a clustering technique adapted from [6]. The worst-case time complexity of MAD is Oðn2 log nÞ. The number of iteration of Step 1 is upper bounded by n. And the complexity ofeach Step 1 is Oðr log r þ gÞ based on Fibonacci-heap implementation, where r, the total number of nodes in the graph is OðcnÞand g, the number of edges is a linear function of n. Therefore, the worst-case complexity of MAD is Oðn2 log nÞ.

The algorithm admits the following iterative modifications.

Algorithm IMAD-1 at iteration kþ 1 constructs a Steiner tree with MAD using the values of delays from the tree built atthe previous iteration. At the 1st iteration, the previous tree is trivial.Let Tk be the tree constructed by the algorithm at the k-th iteration and dk

ij be delay along the arc ði; jÞ 2 Tk. The algorithmuses MAD to construct Tkþ1 and at each step of MAD the edge ðu;vÞ that minimizes the value: tuðTkþ1 [ fu; vgÞ þ dk

uv(ðu;vÞ 2 E;u 2 Tkþ1;v R Tkþ1) is attached to the tree.Algorithm IMAD-2 at iteration kþ 1 constructs a Steiner tree with MAD using the values of delays from all the trees builtat all the previous iterations. At the 1st iteration, the previous tree is trivial.Let �dk

ij be the arithmetic mean of delays dlij along the arc ði; jÞ in trees Tl (l 6 k). The algorithm uses MAD to construct Tkþ1

and at each step of MAD the edge ðu;vÞ that minimizes the value: tuðTkþ1 [ fu;vgÞ þ �dkuv (ðu;vÞ 2 E;u 2 Tkþ1;v R Tkþ1) is

attached to the tree.

The algorithms stop, when Tkþ1 ¼ Tl l 6 k, or maximum number of iterations is performed.

3.2. Preliminary lemmas

Here we introduce some denotations and preliminary lemmas which will be used later.

Remark 1. Let T be a Steiner tree spanning S;u 2 S be a leaf in T and P be the path connecting root 0 and u. T n P is a set ofsubtrees fTð0Þ; . . . ; TðkÞg. Denote by v i the root of TðiÞ;v i 2 P. Vertices fv1; . . . ;vkg split P into kþ 1 parts:P ¼ P1 [ P2 [ . . . [ Pkþ1, where P1 ¼ P0v1 ; Pkþ1 ¼ Pvku and Pi ¼ Pv i�1v i for i ¼ 1; . . . ; k (see Fig. 5). We will use these denotationsin following statements.

Fig. 2. Candidate Steiner points.

Fig. 3. Intermediate tree.

Fig. 4. Final Steiner tree generated by MAD.

Fig. 5. Remark 1.


Fig. 6. Lemma 2.

Fig. 7. Lemma 3.


Consider Steiner tree T. Let t̂u be the delay along the path Pu (disregarding subtrees) and �tuv be the delay in vertex v insubtree Tu n Tv .

Lemma 1. Let T be a Steiner tree for S and u 2 S be a leaf in T. Then

tuðTÞ ¼ t̂u þ r0

Xk

i¼0

CðTðiÞÞ þXk

i¼1

RðPiÞXjPi

CðT ðjÞÞ;

where Pi and TðjÞ are defined as in Remark 1.

Proof. proof is omitted due to page limitation. Details can be obtained in [28]. h

Lemma 2. Given the path P connecting the source vertex 0 with a terminal u and v 2 P (see Fig. 6). Then

tuðPÞ ¼ t̂0v þ t̂vu þ R0vðCvu þ cuÞ:


Lemma 3. Let T be a Steiner tree for S;u 2 S be a leaf in T and v 2 Pu (see Fig. 7). Then

tuðTÞ ¼ �t0v þ �tvu þ R0vCðTvÞ:


4. Congestion aware tree selection

Algorithm IMAD constructs a set of timing-driven Steiner trees for each net in the global graph.The algorithm is applied when the following inputs are provided. Logical network given as a set of nets and primary

inputs with AT(Arrival Time) s and primary outputs with RT (Required Time) s, Number of layers, Specific resistance andcapacitance and maximum number of channels Q ij (capacity of corresponding global edge) in each layer, and Resistancesand capacitances of vias.

The best trees generated by IMAD are stored as candidate tree set for that net. To increase the cardinality of the candidateset, we use FLUTE [7] along with IMAD. FLUTE is a fast and accurate method for construction of rectilinear Steiner minimaltree (RSMT). We create Steiner trees for each net applying L-routing on two pin nets decomposed from multi-pin nets


(decomposition done by FLUTE). We check the delay of the FLUTE generated trees. If the tree generated from FLUTE isnew(i.e. if it was not already generated by IMAD) and has comparable delay with IMAD generated trees, then it is addedto the candidate tree set of that net. We have measured comparable delay of FLUTE as 95% to 105% of delay of trees generatedby IMAD. Therefore, we have several feasible Steiner trees of various types in the candidate set for each net. A gradient basedmethod is described below which is used to pick one tree for each net from its candidate set Qs, so that the total overflow ofthe edges is minimum. Consider the problem of optimal use of routing resources, i.e. the available routing tracks. We useindex e 2 E for edge of global graph G, and index t for trees, where

t 2 J ¼[Ss¼1

Qs;

Set aet ¼1; if edge e 2 the tree t

0; otherwise

(and

xt ¼1; if tree t is selected

0; otherwise:

(

Then the problem is defined as follows:

f ðxÞ ¼Xe2E

max 0;Xt2J

aetxt � qe

( ) !2

! minxt2½0;1�

; ð4Þ

Xt2Qs

xt ¼ 1; s ¼ 1; . . . ; S: ð5Þ

Objective (4) is the sum of penalties for capacity overflows of routing resources qe; e 2 E. If there is no overflow, then thepenalty (4) is zero. Linear relaxation of (4) and (5) is considered. The function f ðxÞ is convex and smooth, therefore thegradient algorithm, described below can be applied.

For the reason of simplicity we rearrange the objective function:

~f ðyÞ ¼Xe2E

max 0; ye � qef gð Þ2;

where ye ¼P

t2Jaetxt; e 2 E.We use the following notations below: let e > 0 be desired precision of solution and L be a lower bound of objective func-

tion of problem (4) and (5).Algorithm GradAl.

Step 0. Initialization.

� ~xt :¼ 1jQs j ðt 2 Q s; s ¼ 1; . . . ; SÞ;

� ~ye :¼P

t2Jaet~xt ðe 2 EÞ;� L :¼ 0.

Step 1. Iteration. Calculate the values

� ge ¼maxf0; ~ye � qeg ðe 2 EÞ;� ts ¼ arg mint2Qs

Pe2Egeaet

and set

� dt ¼1� ~xt ; t ¼ ts;�~xt ; t – ts;

�for each t 2 Qs; s ¼ 1; . . . ; S;

� z ¼ ðzeÞ; ze ¼P

t2Jaetdt ; e 2 E;

� GZ ¼P

e2Egeze.

Recount the lower bound: L :¼maxfL;~f ð~yÞ þ 2GZg.If ~f ð~yÞ � L 6 e maxf1; Lg, then STOP. Else GOTO Step 2.


Step 2. Move to a new point.

Calculate ~p ¼min 1;� GZZZ

� �;GZ is defined above, ZZ ¼

Pe2EðzeÞ2, and set

� ~xt :¼ ~xt þ dt ~p, (t 2 J);

� ~ye :¼ ~ye þ ze~p; ðe 2 EÞ.

Then GOTO Step 1.

Claim 1. The algorithm provides ð1þ eÞ-approximation solution of the problem (4) and (5). Its time complexity isOðjEjjJje�1 ln e�1Þ.

Proof. First demonstrate that the objective function decreases at each iteration. Since function ~f ðyÞ is convex, then

~f ðyÞP ~f ð~yÞ þ 5~f ð~yÞðy� ~yÞ ¼ ~f ð~yÞ þXe2E

2geðye � ~yeÞ;

where y is arbitrary feasible point. Estimate the last expression by solving the problem
Xe2E
geye ¼Xe2E

ge

Xt2J

aetxt ¼Xt2J

Xe2E

geaet

!xt !min

x;

The problem above decomposes into the independent subproblems for each net s:

minx

Xt2J

Xe2E

geaet

!xt ¼

XS

s¼1

mint2Qs

Xe2E

geaet

!¼XS

s¼1

Xe2E

geaets ;

the last equality follows from the definition of ts. Thus,
Xe2E
geðye � ~yeÞPXS

s¼1

Xe2E

geaets �Xe2E

ge~ye ¼Xe2E

ge

XS

s¼1

aets �Xt2J

aetxt

!¼Xe2E

ge

XS

s¼1

aets 1� xtsð Þ �X

t2J; t–ts

aetxt

!

¼Xe2E

ge

Xt2J

aetdt ¼ GZ:

Therefore,

~f ðyÞP ~f ð~yÞ þ 2GZ;

L ¼ ~f ð~yÞ þ 2GZ is a lower bound for ~f ðyÞ, and if ~f ð~yÞ � L 6 e maxf1; Lg, then ~y defines a ð1þ eÞ-approximation. Otherwise thedescent direction is defined by vectors d and z. h

The descent direction d is from current point ~x to the integer point x ¼ ðxtÞ, where xt ¼ 1 if t 2 ft1; . . . ; tsg, and xt ¼ 0,otherwise. In order to find a stride parameter p 2 ½0;1�, we can find minimum of function

hðpÞ ¼ f ð~yþ zpÞ ¼Xe2E

max 0; ~ye � qe þ zepf gð Þ2;

but we will use simpler function

rðpÞ ¼Xe2E

ðge þ zepÞ2:

It is easy to check that hð0Þ ¼ rð0Þ;h0ð0Þ ¼ r0ð0Þ;hðpÞ 6 rðpÞ for all p 2 ½0;1�, and function r reaches a minimum in~p ¼min 1;� GZ

ZZ

� �.

If the current point ~y is not optimal, then h0ð0Þ < 0; r0ð0Þ < 0 and ~p > 0. Inequality hð0Þ > hð~pÞ follows from the expressionhð0Þ ¼ rð0Þ > rð~pÞP hð~pÞ.

Therefore, moving into point ~x0t ¼ ~xt þ dt ~p ðt 2 JÞ and ~y0e ¼ ~ye þ ze~p ðe 2 EÞ does not increase the objective function.Now we prove, that each iteration has time complexity OðjEjjJjÞ, and the total number of iterations is bounded by

Oðe�1 ln e�1Þ.Let us first find a lower bound for q ¼ rð0Þ � rð~pÞ for one iteration. Set D ¼ �2GZ. From definition of rðpÞ and step (stride

parameter) ~p follows that if D P 2ZZ, then q P D=2, and if D 6 2ZZ, then q P D2

4ZZ . Finally

Table 1Results

Bencnam

ibm0ibm0ibm0ibm0ibm0ibm0ibm0ibm0ibm1


q P minD2;

D2

4ZZ

( )¼ D

2min 1;

D2ZZ

� �:

The same estimation takes place for argument d ¼ f ð~yÞ � f ð~yþ z~pÞ of objective function, since d P q.Let us now estimate the upper bound for D ¼ f ð~yÞ � L. Define Q as the total number of iterations. Parameters of one iter-

ation we define by upper index l ¼ 0;1; . . . ;Q . Step 0 corresponds iteration l ¼ 0, then ~y0 is initial point, L0 ¼ 0 and

D0 ¼ f ð~y0Þ � L0 ¼ f ð~y0Þ. Define c as number which is greater than each f ð~ylÞ, Dl and ZZl; l ¼ 0;1; . . ., even if the process is infi-nite. Such number exists since the feasible set of linear relaxation problem is compact.

From equalities f ð~ylÞ ¼ f ð~yl�1Þ � dl; Ll ¼maxfLl�1; f ð~yl�1Þ � Dlg and Dl ¼ f ð~ylÞ � Ll follows that Dl ¼minfDl�1;Dlg � dl. Tak-

ing into account the above lower bound for dl, the definition of number c and inequality Dl P 0, we get

Dl6minfDl�1;Dlg � ðD

lÞ2

4c6 max

vP0minfDl�1;vg � v2

4c

� �:

Maximum of the last expression reached when v ¼ Dl�1, then

Dl6 Dl�1 � ðD

l�1Þ2

4c¼ 1� Dl�1

4c

!Dl�1:

Continuing this inequality using inequality Dl�1 > e, which is true at each iteration except the last one. As a result, we get

Dl6 1� e

4c

� �Dl�1

6 1� e4c

� �l

D06 1� e

4c

� �l

c:

The number of iterations Q is bounded by TðeÞ, and 1� e4c

� TðeÞc ¼ e and since lime!0TðeÞ

e�1 ln e�1 ¼ 4c, the complexity do notexceed Oðe�1 ln e�1Þ.

5. History based MAD

To reduce the overflow further a penalty based on congestion history is added to the original cost function. Optimizingcongestion and delay at the same time is a difficult task as minimizing delay can lead to higher congestion and vice versa.Therefore, a trade-off between the delay and the overflow is done in the cost function so that the delay is not increasedbeyond a certain range to mitigate the congestion. Penalty is added to the history based cost of all the overflowed edges.

The history based cost at ðiþ 1Þth iteration of an edge e is given as.

hiþ1e ¼ hi

e þ c; if edge e has overflowhi

e; otherwise

(, where hi

e is the history based cost of edge e at ith iteration and c is a constant

which can be increased to give more weightage to congestion in the cost function. The actual cost of the edge e is now cal-culated as ce ¼ ðbe þ heÞ:pe, where be represents the base cost of Step 1 of algorithm MAD of section 3, which is dedicated tooptimize delay. And pe represents the current congestion penalty. This approach is called Negotiated Congestion Routing(NCR) [19]. pe can be obtained using the following equations. Let the density of an edge,

dðeÞ ¼ demandðcurrent number of wires passing through the edgeÞsupplyðcapacityÞ . Then the congestion penalty term pe[24] for edge e is defined as,

pe ¼expðbðde � 1ÞÞ; if de > 1de; otherwise

�, where value of b used is ln 5 in the experiments.

of MAD with and without Gradient on Examples Derived from IBM/ISPD98 benchmarks.

he

Grids # ofNets

MAD w/o Gradient History based mad with gradient

Max ovfl Total ovfl Delay (ps) Run time (s) Max ovfl Total ovfl Delay (ps) Run time (s)

1dd 64 � 64 11507 16 2494 23209.31 12.4 6 288 23539.25 21.22dd 80 � 64 18429 17 2692 31201.34 18.9 4 59 31764.75 34.53dd 80 � 64 21621 12 660 37339.23 17.3 2 3 37643.41 36.54dd 96 � 64 26163 12 1574 54112.32 43.9 5 138 54289.78 83.56dd 128 � 64 33354 16 2975 62911.21 47.4 5 90 63412.81 104.37dd 192 � 64 44394 9 270 84201.69 134.2 0 0 84976.51 228.18dd 192 � 64 47944 22 3531 91381.24 113.7 4 43 91729.24 238.79dd 256 � 64 50393 10 1920 92817.23 189.1 4 77 93102.13 359.30dd 256 � 64 64227 16 2483 126672.23 189.4 5 32 127134.33 435.7

Table 2% of FLUTE generated trees selected by Gradient.

Bench name % of FLUTE tree selected

ibm01dd 34.1ibm02dd 37.8ibm03dd 33.8ibm04dd 38.3ibm06dd 34.9ibm07dd 27.2ibm08dd 34.1ibm09dd 31.7ibm10dd 30.1Average 33.6


6. Diversity of the generated steiner trees

The performance of the Gradient algorithm depends on the diversity of the candidate Steiner tree set. The dissimilaritybetween the trees will give a number of routing possibilities for each net. Using a combination of trees, the congestion can beeffectively reduced. To measure the diversity of the Steiner tree pool, we use the metric described in [11]. Diversity of thepool,

D ¼PN�1

i¼1

PN

j¼iþ1Dij

NðN�1Þ2

¼PN�1

i¼1

PN

j¼iþ11�Sij

NðN�1Þ2

¼PN�1

i¼1

PN

j¼iþ11�

commðEi ;Ej ÞmaxðEi ;Ej Þ

NðN�1Þ2

, where N is number of candidate Steiner trees in the pool,

Sij is similarity between two trees i and j,Dij is ð1� SijÞ, which is dissimilarity between two trees i and j,commðEi; EjÞ is percentage of common edges shared by tree i and j, andmaxðEi; EjÞ represents maximum number of edges in either tree.

7. Experimental results

Algorithm IMAD, history based MAD and FLUTE generate a set of perspective trees for each net. For the experiment, themaximum number of candidate trees used in the pool is 7. The average diversity of the trees is 32%. A tree is chosen for eachnet using the gradient algorithm such that minimal residual capacity of global edges is maximal.

All algorithms are implemented in C on a quad-core AMD Opteron machine on Linux. We note that no benchmark suitesare available for delay driven routing. Existing methods [30,29,14] consider examples too small (maximum number of nets1294) to be meaningful for modern VLSI physical design practices. Also, being a multi-objective NP-hard problem, our delaydriven routing with practical running times reduces overflow similar to the existing algorithms and unlike congestion awareonly routing algorithms. We run MAD without Gradient and IMAD with Gradient on a set of examples derived from mod-ifying the IBM/ISPD 98 benchmark suite by assigning resistance(0.016 specific resistance) and capacitance(0.47 specificcapacitance) values to the wires. The modified examples are renamed with suffix ‘‘dd’’ to the benchmark designs.

The results are shown in Table 1. From Table 1, it can be seen how Gradient reduces the overflow efficiently. In Table 2, weshow the percentage of FLUTE generated trees selected by the gradient algorithm. Average 33:6% selected trees are gener-ated by FLUTE, i.e 66:4% selected trees are from MAD, which establishes the importance of MAD as a timing-driven router.

8. Conclusion

In this paper, we proposed a provably tight timing-driven Steiner tree construction algorithm. Also we proposed a gra-dient based method for minimizing the overflow. We have implemented our algorithm on modified benchmarks derivedfrom large industry-standard benchmarks called ibm/ISPD 98 benchmarks. None of the available timing-driven global rout-ing algorithms work on such large number of nets. The current algorithm has limitation on getting zero-overflow solutions.In future research, we are working on further reduction of the overflow while maintaining the timing at the sinks.

Appendix A. Theoretical Analysis of MAD

In this section, we investigate the performance of MAD applied to grid graphs. We call G a M � N grid graph if the set ofvertices VðGÞ ¼ fðm;nÞj 0 6 m 6 M;0 6 n 6 Ng for some M and N and EðGÞ ¼ fðv1;v2Þjv1 ¼ ðm;nÞ;v2 ¼ ðm;nþ 1Þorv2 ¼ ðmþ 1;nÞg. Weighted grid graph G belongs to class C1 if resistances and capacitances of all the edgesof one line (horizontal or vertical) are equal, i.e. given edges e1 ¼ ððm1;n1Þ; ðm1;n1 þ 1ÞÞ and e2 ¼ ððm1;n2Þ; ðm1;n2 þ 1ÞÞ, thenre1 ¼ re2 and ce1 ¼ ce2 . The same property holds for horizontal edges.

Weighted grid graph G belongs to class C2, if G 2 C1 and additionally for each e 2 EðGÞ; ce ¼ k � re holds for some k. In thissection, we will show that, (a) If G 2 C1 is a m� n grid graph with single terminal vertex u, then MAD gives the minimum


delay path connecting the source vertex 0 and terminal u and (b) If G 2 C2 and contains n terminal vertices, then MAD pro-vides n-approximation solution of problem (3).

A.1. Grid graphs with a single terminal

Here we assume graph G belongs to class C1 and contains the only terminal vertex u : S ¼ f0; ug. We scrutinize the per-formance of MAD on such kind of graphs and demonstrate the optimality of the path obtained.

First we show that, given two edges of a line (horizontal or vertical), the nearest to the root 0 is added to the tree first.

Claim 2. Given a graph G 2 C1 with single terminal vertex u. Let T be a partially constructed tree and verticesu1ðx1; yÞ;u2ðx2; yÞ 2 T and v1ðx1; yþ 1Þ;v2ðx2; yþ 1Þ R T ðx1 < x2Þ. Then

tv1ðT [ fðu1;v1ÞgÞ þ dðu1;v1Þ < tv2ðT [ fðu2;v2ÞgÞ þ dðu2;v2Þ:

Similar inequality holds for horizontal edges.

Proof. We omit this technical proof for the reason of space. h

Claim 3. Let G 2 C1 be a m� n grid graph with single terminal vertex u and T be a partially constructed tree. If vertexvðx; yÞ 2 T then all the vertices wði; jÞ 2 T ði 6 x; j 6 yÞ.

Proof. proof is omitted due to page limitation. Details can be obtained in [28] h

Theorem 1. Let G 2 C1 be a m� n grid graph with single terminal vertex u and P be the path constructed by MAD. Then

tuðPÞ ¼minP0u

tuðP0uÞ:


A.2. Grid graphs with n P 2 terminals

In this section, we consider operation of MAD on grid graphs with n terminal vertices and estimate its approximation ratioin a special case.

Claim 4. Given graph G containing n terminal vertices. Assume MAD connects terminals in the following order:fu1;u2; . . . ;uk; . . . ;ung and denote partially constructed tree spanning set fu1; . . . ;ukg by Tk. Then

t�ðTkÞ 6 t̂ukþ t�ðTk�1Þ:


Remark 2. Let S be the set of terminal vertices and T be a Steiner tree for S. Claim 4 yields the following estimation for crit-ical delay in T:

t�ðTÞ 6Xu2S

t̂u:

Theorem 2. Given G 2 C2 and S ¼ f0;1; . . . ;ng. Let TMAD be the tree constructed by MAD and Topt be an optimal tree spanning S.Then

t�ðTMADÞt�ðToptÞ

6 n:

Proof. Use Remark 2. Let ui be the i-th terminal attached to TMAD and Toptðu1;u2; . . . ;uiÞ be an optimal tree spanning setfu1;u2; . . . ;uig. As G belongs to class C2 delays along all the paths connecting 0 and ui are obviously equal. Also it is obviousthat t̂ui

6 t�ðToptðu1;u2; . . . ;uiÞÞ. Hence,

Fig. A.8. Graph G.

Fig. A.9. Tree T1.

Fig. A.10. Tree T2.


t�ðTMADÞ 6Xn

i¼1

t�ðToptðu1; u2; . . . ;uiÞÞ 6Xn

i¼1

t�ðToptðu1; u2; . . . ;unÞÞ ¼ n � t�ðToptðu1;u2; . . . ;unÞÞ;

Therefore,


6 n:

h

A.3. Tightness of the bound

In this section we demonstrate that the accuracy estimate of MAD obtained in the previous section is tight. Consider gridgraph G 2 C2 with n ¼ 2k terminals depicted in Fig. A.8 (similar example for graph with odd number of terminals can be eas-ily constructed).

Let re ¼ ce for each e 2 EðGÞ, resistances of all the vertical and ‘‘short’’ horizontal edges equal e and resistances of all the‘‘long’’ horizontal edges equal L. Capacities of all the terminals equal zero. We specify values r0; e and L later.

Consider two Steiner trees: T1 (Fig. A.9) and T2 (Fig. A.10).It is easy to prove that critical delays in trees T1 and T2 satisfy the following relations:

t�ðT1Þ ¼ r0Lþ r0ðk2 þ 3k� 2Þeþ e2 43

k3 þ 2k2 � 3kþ 12

�þ eLðk2 þ 3k� 1Þ þ L2

2ðA:1Þ

t�ðT2Þ ¼ 2kr0Lþ r0ð4k2 � k� 1Þeþ eLð4k� 3Þ þ e2 5k2 � 6kþ 32

�þ L2

2ðA:2Þ

Choose d > 0 such that d n L� 1 and set e ¼ dk2L; r0 ¼ L2. In this case (A.1) and (A.2) imply:

t�ðT2Þt�ðT1Þ ¼ 2k� O

1L

�< 2k; ðA:3Þ

and the ratio converges to 2k when L!1. One may easily see that tree T2 can be constructed by MAD. Also it is obvious thatt�ðToptÞ 6 t�ðT2Þ. So we have


P 2k� O1L

�; ðA:4Þ

and the following Remark is true.

Remark 3. Given G 2 C2 containing n terminal vertices. Then MAD provides n-approximation solution of problem (3).


References

[1] C. Albrecht, Global routing by new approximation algorithms for multicommodity flow, IEEE Trans. on CAD 20 (2001) 622–632.[2] E. Bozorgzadeh, R. Kastner, M. Sarrafzadeh, Creating and exploiting flexibility in steiner trees, in: ACM/IEEE Design Automation Conference, ACM, 2001,

pp. 195–198.[3] R. Carden, C.K. Cheng, A global router using an efficient approximate multicommodity multiterminal flow algorithm, in: ACM/IEEE DAC, 1991, pp. 316–

321.[4] Y.A. Chen, Y.L. Lin, Y.C. Hsu, A new global router for asic design based on simulated evolution, in: Int. Symp. on VLSI Technology, Systems and

Applications, 1989, pp. 261–265.[5] M. Cho, D.Z. Pan, Boxrouter: a new global router based on box expansion and progressive ILP, in: Design Automation Conference, 2006, pp. 373–378.[6] S. Chowdhury, G. Grewal, D. Banerji, Clustering hanan points to reduce VLSI interconnect routing times, in: IEEE Canadian Conference on Electrical &

Computer Engineering, 2006, pp. 1223–1227.[7] C. Chu, Y.C. Wong. Fast and accurate rectilinear steiner minimal tree algorithm for VLSI design, in: International Symposium on Physical Design, 2005,

pp. 28–35.[8] C. Sechen, A. Sangiovanni-Vincentelli, The timberwolf placement and routing package, IEEE J. Solid-State Circuits SC-20 (1985) 510–522.[9] H. Esbensen, A macro-cell global router based on two genetic algorithms, in: EDAC, 1994, pp. 428–433.

[10] N. Garg, J. Konemann, Faster and simpler algorithms for multicommodity flow and other fractional packing problems, in: 39th Ann. IEEE Symp. onFoundations of Computer Science, 1998, pp. 253–259.

[11] G. Grewal, X. Yu, M. Xu, Generating diverse pools of steiner trees for VLSI routing, in Canadian Conference on Electrical and Computer Engineering,2005, pp. 677–685, 2005.

[12] X. Hong, T. Xue, J. Huang, C. Cheng, E.S. Kuh, Tiger: an efficient timing-driven global router for gate array and standard cell layout design, in: IEEE TCADof Integrated Circuits and Systems, 2006, pp. 1323–1331.

[13] H. Hou, J. Hu, S.S. Sapatnekar, Non-Hanan routing, IEEE Trans. Comput Aided Des. Integr. Circuits Syst. 18 (4) (1999) 436–444.[14] J. Hu, S.S. Sapatnekar, A timing-constrained simultaneous global routing algorithm, IEEE TCAD Integr, Circuits Syst. 21 (2002) 1025–1036.[15] T.C.W.J.R. Gao, P.C. Wu, A new global router for modern designs, In: Asia and South Pacific Design Automation Conference, 2008, pp. 226–231.[16] M.R. Kramer, J.V. Leeuwen, The complexity of wire routing and finding minimum area layouts for arbitrary VLSI circuits, Adv. Comput. Res. VLSI theory

2 (1984) 129–146.[17] Y. Li, H. Gu, Fault tolerant routing algorithm based on the artificial potential field model in network-on-chip, Appl. Math. Comput. 217 (7) (2010) 3226–

3235.[18] M.D.F.W.M.M. Ozdal, Archer: a history-driven global routing algorithm, in: IEEE/ACM International Conference on Computer-Aided Design, 2007, pp.

488–495.[19] L. McMurchie, C. Ebeling, Pathfinder: a negotiation-based performance-driven router for fpgas, in: FPGA, 1995, pp. 111–117.[20] M.D. Moffitt, Maizerouter: engineering an effective global router, in: Asia and South Pacific Design Automation Conference, 2008, pp. 226–231.[21] M.D. Moffitt, J.A. Roy, I.L. Markov, The coming of age of (academic) global routing, in: ISPD, 2008, pp. 148–155.[22] W.C. poore, The transient response of damped linear networks with particular regards to wide-band amplifiers, J. Appl. Phys. 19 (1948) 55–63.[23] J.A. Roy, I.L. Markov, High-performance routing at the nanometer scale, in: ICCAD, 2007, pp. 496–502.[24] J.A. Roy, I.L. Markov, High-performance routing at the nanometer scale, in: 2007 International Conference on Computer-Aided Design (ICCAD 07),

November 5–8, 2007, San Jose, CA, USA, 2007, pp. 496–502.[25] J. Rubinstein, P. Penfield, M.A. Horowitz, Signal delay in RC tree networks, IEEE Trans. CAD Integr. Syst. 2 (1983) 201–211.[26] R. Samanta, A.I. Erzin, S. Raha, Y.V. Shamardin, I.I. Takhonov, V.V. Zalyubovskiy, A provably tight delay-driven concurrently congestion mitigating

global routing algorithm, in: International conference on Numerical Computations: Theory and Algorithms (NUMTA), Falerna, Italy, 2013, p. 119.[27] F. Shahrokhi, D.W. Matula, The maximum concurrent flow problem, J. ACM 37 (1990) 318–334.[28] I.I. Takhonov. On properties of a Dijkstra-based algorithm for constructing delay driven Steiner trees. Unpublished Manuscript, Novosibirsk State

University, Russia.[29] J.T. Yan, Dynamic tree reconstruction with application to timing-constrained congestion-driven global routing, IEEE Proc. Comput. Digital Tech. 153 (2)

(2006) 117–129.[30] J.-T. Yan, S.-H. Lin, Timing-constrained congestion-driven global routing, in: Asia and South Pacific Design Automation Conference, 2004, pp. 683–686.[31] H. Youssef, S.M. Sait, Timing-driven global routing for standard-cell VLSI design, Comput. Syst. Sci. Eng. 14 (1999) 175–185.

http://refhub.elsevier.com/S0096-3003(14)01594-X/h0005














Date post:	25-Nov-2023
Category:	Documents
Upload:	nsu-ru
View:	0 times
Download:	0 times

A provably tight delay-driven concurrently congestion mitigating global routing algorithm

Documents