Design Theory and Implementation for Low-Power ...hil/paper/tdaoes03.pdfDesign Theory and...

Design Theory and Implementation forLow-Power Segmented Bus Systems

W.-B. JONEUniversity of CincinnatiJ. S. WANGNational Chung-Cheng UniversityHSUEH-I LUAcademia SinicaI. P. HSUFaraday Technology Corp.andJ.-Y. CHENWinbond Electronics Corp.

The concept of bus segmentation has been proposed to minimize power consumption by reducingthe switched capacitance on each bus [Chen et al. 1999]. This paper details the design theoryand implementation issues of segmented bus systems. Based on a graph model and the Gomory-Hu cut-equivalent tree algorithm, a bus can be partitioned into several bus segments separatedby pass transistors. Highly communicating devices are placed to adjacent bus segments, so mostdata communication can be achieved by switching a small portion of the bus segments. Thus, asignificant amount of power consumption can be saved. It can be proved that the proposed buspartitioning method achieves an optimal solution. The concept of tree clustering is also proposedto merge bus segments for further power reduction. The design flow, which includes bus tree con-struction in the register-transfer level and bus segmentation cell placement and routing in thephysical level, is discussed for design implementation. The technology has been applied to a ¹-controller design, and simulation results by PowerMill show significant improvement in powerconsumption.

Categories and Subject Descriptors: B.7.1 [Integrated Circuits]: Types and Design Styles—VLSI(very large scale integration); B.7.2 [Integrated Circuits]: Design Aides—Placement and routing;

Authors’ addresses: W.-B. Jone, Department of Electrical and Computer Engineering and Com-puter Science, University of Cincinnati, Rhodes 836B, P.O. Box 210030, Cincinnati, OH 45221-0300; email: [email protected]; J. S. Wang, Department of Electrical Engineering, NationalChung-Cheng University, 16 San-Hsing, Ming-Hsiung, Chia-Yi, 621 Taiwan, R.O.C.; email:[email protected]; H.-I. Lu, Institute of Information Science, Academia Sinica. 128 AcademiaRd., Section 2, Taipei 115, Taiwan, R.O.C.; email: [email protected]; I. P. Hsu, Faraday Tech-nology Corp., 10-2, Li-Hsin First Road, Science-Based Industrial Park, Hsinchu, Taiwan 300,R.O.C.; email: [email protected].; J.-Y. Chen, Winbond Electronics Corp., Computer ProductDesign Dept. III Section Manager, No. 4, Creation Rd. III, Science-Based Industrial Park, Hsinchu,Taiwan, R.O.C.; email; [email protected] to make digital /hard copy of part or all of this work for personal or classroom use isgranted without fee provided that the copies are not made or distributed for profit or commercialadvantage, the copyright notice, the title of the publication, and its date appear, and notice is giventhat copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or toredistribute to lists requires prior specific permission and/or a fee.C© 2003 ACM 1084-4309/03/0100-0038 $5.00

ACM Transactions on Design Automation of Electronic Systems, Vol. 8, No. 1, January 2003, Pages 38–54.

Theory and Implementation of Segmented Bus Systems • 39

G.2.2 [Discrete Mathematics]: Graph Theory—Graph algorithms; J.2 [Computer-AidedEngineering]: Computer-Aided Design (CAD)

General Terms: Algorithms, Design, Theory

Additional Key Words and Phrases: ASIC design, low-power design, bus segmentation, bus graphmodel, OLA tree, bus segmentation cell, low-power design flow

1. INTRODUCTION

As the VLSI technology is moving to even higher densities of integration,power consumption has been one of the most important limitations in thedesign of VLSI circuits. Thus, VLSI design for power optimization to satisfythe power budget is an important research issue [Chandrakasan et al. 1992;Chandrakasan and Brodersen 1995]. Generally, a significant portion of powerin a chip is consumed by its bus system. For example, the bus wires dissipateabout 15% to 30% of the total power in DEC Alpha 21064 and Intel 80386 [Liuand Svenoson 1994]. The work by Mehra et al. [1997] also demonstrates that atleast 20% to 35% of total power is dissipated by the bus wires in the case of QMFfilters. Researchers have been trying to optimize the bus power from severaldifferent design aspects. Previous work can basically be divided into four cate-gories: (1) design of bus drivers/receivers to decrease the bus swing [Najagomeet al. 1993; Cardarilli et al. 1996; Golshan and Haroun 1994; Bellaouar et al.1995; Ikeda and K. Asada 1994; Hiraki et al. 1995; Caufape and Figueras 1996],(2) encoding techniques to reduce the bus switching activity [Stan and Burleson1995a, 1995b], (3) bus structure redesign to take advantage of local communi-cations [Mehra et al. 1997], and (4) implementation of low-power bus usingtechnologies, such as emitter-coupled logic (ECL) and bipolar [Sundstorm et al.1990]. Here, we focus on complementary metal-oxide semiconductor (CMOS)circuits only.

The concept of bus segmentation was proposed in Chen et al. [1999] to changethe bus topology for the purpose of reducing the power consumption. The basicidea is to partition a single (and large) bus into several tree-structured bussegments that are separated by pass transistors as shown in Figure 1. Thesegmented bus system is designed in such a way that heavily communicat-ing devices are connected to the adjacent bus segments. Thus, by turning offsome pass transistors, only part of the bus segments are involved in the datacommunication when two devices exchange data. It is assumed that the busoperation is dynamic and separated into two phases: the precharge phase andthe evaluation phase. In the precharge phase, the precharge p-channel metal-oxide semiconductor (PMOS) is turned on and the entire bus is charged to highvoltage. During the bus evaluation phase, the devices on the bus will commu-nicate with one another by enabling the control signal (of the sender) and theinput latch (of the receiver). By this design, instead of discharging the entirebus, only a small number of the bus segments are involved in each device com-munication and power consumption can be greatly reduced. Since each datacommunication is performed through bus segments which are adjacent, thebus communication speed can be enhanced as well.

ACM Transactions on Design Automation of Electronic Systems, Vol. 8, No. 1, January 2003.

40 • Jone et al.

Fig. 1. Bus segmentation.

This paper elaborates the design theory and implementation issues of seg-mented bus systems. A bus graph model will be used to represent the bus com-munication system under consideration. Each edge of the bus graph is weightedby a communication frequency if both devices connected by this edge have dataexchange. Given the weighted bus graph, we apply the efficient Gomory-Humethod [Gomory and Hu 1961] to partition the single bus into a bus tree whereeach tree segment is connected by a single device. It can be proved that thebus partitioning method gives an optimal solution in polynomial time. A tree-clustering method is then used to connect several devices into a bus segment ifsuch segment sharing can further reduce the power consumption.

To efficiently implement the proposed bus segmentation technique, we incor-porate the bus tree construction process into the register-transfer level (RTL)of our design flow. The resultant RTL design is then synthesized to obtain thegate-level circuits. One major enhancement in RTL design is to partition thebus into segments and add the required control signal generation circuits tothe control unit. The other major enhancement is in the physical level designwhere the bus segmentation cells are designed and integrated into the wholechip placement and routing process. The bus segmentation technology has alsobeen implemented to an 8051-compatible 8-bit ¹-controller. Simulation usingPowerMill shows significant improvement in power consumption.

Section 2 gives background on the bus graph model and describes the busgraph partitioning problem. We prove that the bus segmentation problem canbe solved optimally using the well-known Gomory-Hu algorithm. Section 3 dis-cusses the design and implementation issues including design flow enhance-ment, design environment integration, and placement and routing guidelines.Experimental results by designing and simulating a ¹-controller chip are re-ported in Section 4. Finally, concluding remarks are given in Section 5.

2. BUS GRAPH MODEL AND OPTIMAL LINEAR ARRANGEMENT TREE

The basic idea of bus segmentation is to divide the entire bus into several smallbus segments, as shown in Figure 1, such that data exchange among deviceswill result in minimum power consumption. The power model of a circuit isgenerally expressed as P = ∑Ci · V 2

dd · fi, where Ci is the load capacitance ofcircuit i, Vdd the operation voltage, and fi the switching frequency. With the



Fig. 2. (a) An edge-weighted graph. (b) A bus tree implementation. (c) The corresponding bus tree.

assumption of a fixed operation voltage, the term Vdd can be dropped in the fol-lowing discussion. Thus, the power model is simplified as

∑(Ci · fi), where the

switching frequency fi can be minimized at the algorithm, architecture, andlogic design levels. Finally, the load capacitance Ci can be greatly reduced atthe circuit and physical design levels. Therefore, minimizing the switched ca-pacitance Ci · fi for each device i at the circuit and physical design levels canbe used as the last resort in low-power design. This work concentrates on bussegmentation which leads to switched capacitance reduction.

Based on several assumptions [Chen et al. 1999], the bus segmentation prob-lem can be modeled to partition the bus graph into a bus tree such that thetotal switched capacitance, and thus the power consumption, by bus switchingis minimized. As shown in Figure 2(a), each node in the bus graph representsa device, while each weighted edge represents the communication frequencybetween a pair of devices. A possible bus tree implementation for the bus graphof Figure 2(a) is shown in Figure 2(b). Figure 2(c) is the corresponding bus tree.The switched capacitance between devices i and j is the product of weightG(i, j )(i.e., the relative communication frequency) and the capacitance C(i, j ) requiredto be charged or discharged, when devices i and j perform signal exchange. Thecapacitance cost of a signal communication between devices i and j can be es-timated by the formula k1(n − 1) + k2, where k1 and k2 are some capacitancevalues and n is the number of bus segments charged or discharged for com-municating devices i and j . Details of the derivation for the capacitance costformula can be found in Chen et al. [1999].

By the above discussion, in order to achieve low-power design for a bus orga-nization, it suffices to minimize the number (n−1) of traversed bus segments foreach pair of data communication. Some definitions are required. The distance oftwo nodes i, j , denoted dist(T, i, j ), on a tree T is the number of edges traversedfrom i to j on T . Let G be a weighted graph. Let T be a spanning tree of G.The linear arrangement cost of T corresponding to G, denoted costLA(T, G), is∑

e=(i, j )∈E weightG(e) · dist(T, i, j ). The bus segmentation problem is as follows:

Given an edge-weighted graph G = (V , E), the bus segmentation problem isto identify a spanning tree T whose linear arrangement cost, costLA(T, G), isminimum.

The derived spanning tree T is called an optimal linear arrangement (OLA)tree. Given an edge-weighted graph G, here, the algorithm by Gomory and


42 • Jone et al.

Input: An undirected graph G = (V , E), and a function weightG : E → R+Output: A tree G′ = (V , E ′)begin

Create a super node which is composed of all nodes in V .while any super node contains more than one node do

Select a super node that contains at least two nodes.Arbitrarily take two nodes u, v in the super node.Get a min u-v cut, cutG (A).Spilt the super node based on the min-cut into two super nodes,and have an edge with weight weightG (cutG (A)).

return the resulting tree of super nodes.end

Fig. 3. The algorithm for computing a Gomory-Hu cut tree.

Fig. 4. (a) The initial super node. (b) The tree of super nodes after obtaining the min 1-2 cut. (c)The tree of super nodes after obtaining the min 2-3 cut.

Hu [1961] will be applied to find a tree on G with minimum linear arrangementcost in polynomial time. Some definitions are required for elaborating our tree-construction algorithm. Let G = (V , E) be a connected edge-weighted graph. Asubset E ′ of E is a cut of G if G − E ′ is not connected. A cut E ′ of G is a u-v cutof G if u and v are not connected in G− E ′. For every subset A of V , let cutG(A)consist of the edges of G with exactly one endpoint in A. Define the weight ofcutG(A) to be weightG(cutG(A)) = ∑e∈cutG (A) weightG(e). A u-v cut is a min u-vcut if its weight is minimum over all u-v cuts of G. For example, in Figure 2(a),if A contains nodes 1, 2, 6, and 7 then cutG(A) is a cut containing edges (2, 3), (3,6), and (5, 6). The weight of the cut, weightG(cutG(A)), equals 1.3. Furthermore,cutG(A) is a 6-3 cut and is also a min 6-3 cut. However, cutG(A) is a 1-4 cut butnot a min 1-4 cut.

The Gomory-Hu (GH) cut-equivalent tree algorithm was proposed to solvethe multiterminal network flow problem [Gomory and Hu 1961]. Gomory andHu showed that a minimum cut between nodes i and j gives the maximalflow between them. The same concept can be applied to solve our OLA-treeproblem in polynomial time. The basic idea of the Gomory-Hu cut-equivalenttree algorithm is to arbitrarily select two nodes, then find a minimum cut thatdisconnects these two nodes. The min-cut is then represented by an edge. Thisprocess is repeated until the graph becomes a tree. The algorithm is shown inFigure 3.

The algorithm is illustrated using the graph in Figure 2(a). In the beginningof the algorithm, all nodes are clustered in a super node, as shown in Figure 4(a).



Fig. 5. A min 1-2 cut and a min 2-3 cut.

Fig. 6. (a) The final topology of the super nodes. (b) All min cuts.

We arbitrarily select two nodes, say 1 and 2 in Figure 5, to be disconnected.Since a min 1-2 cut consists of edges (1, 2) and (1, 6), the super node can bedivided into two super nodes (1) and (2, 3, 4, 5, 6, 7) with weight equal to 1.5(Figure 4(b)). Next, we try to isolate nodes, say, 2 and 3 as shown in Figure 5. Amin 2-3 cut consists of edges (2, 3), (6, 3), and (6, 5), so the super node (2, 3, 4,5, 6, 7) is then divided into super nodes (2, 6, 7) and (3, 4, 5) with the minimumweight, 1.3, assigned (Figure 4(c)). Note that the min 2-3 cut puts nodes 1, 2, 6,and 7 on the same side. Therefore, super node (2, 6, 7) is attached to super node(1) (Figure 4(c)). By repeating the above process, we obtain an edge-weightedtree (Figure 6(a)) whose corresponding min-cuts are shown in Figure 6(b). Themaximum flow (or the weight of minimum cut) for each pair of nodes can befound directly from the tree. For example, the maximum flow between nodes 1and 3 is 1.3, which can be observed from Figure 6(a). A GH cut-tree is the treegenerated by the algorithm in Figure 3.

The preceding example demonstrates that the collection of all cuts is gener-ated by greedily deriving the minimum cut between two nodes. It was shownin Gomory and Hu [1961] that the value of the maximum flow between any twonodes, u and v, is the minimum edge weight between u and v in the cut tree.


44 • Jone et al.

Fig. 7. (a) A set of four noncrossing cuts. (b) Cuts 1 and 2 cross each other.

By the max-flow/min-cut theorem, since such an edge represents the max flowbetween u and v, it is a min-cut between u and v as well. It can be proved thatthe collection of cuts derived by the greedy method is minimum over all u-v cuts.

Let G = (V , E) be a connected graph with positive edge weights. If thereexist two cuts, say cutG(A) and cutG(B), such that each one of the four setsA ∩ B, A − B, B − A, V − A − B is nonempty, then cutG(A) and cutG(B) arecalled crossing cuts. A set of cuts is noncrossing if no two cuts in this collectioncross each other. Figure 7(a) shows a set of noncrossing cuts. Figure 7(b) showsa set of three cuts where two of them cross each other. It was shown in Gomoryand Hu [1961] that the collection of min-cuts in the GH cut-tree we found isnoncrossing.

THEOREM 2.1 (SEE HARTVIGSEN AND MARGOT [1995]). The following statementsare equivalent:

—C is a minimal collection of noncrossing cuts in G such that C contains a minu-v cut for every two nodes u and v in G.

—C is the minimum weight collection of noncrossing cuts in G such that Ccontains a u-v cut for every two nodes u and v in G.

Theorem 2.1 implies that the weight collection (by summing all edge weights)of Figure 6(a) is minimum over all topologies.

Before proving the equality between the GH cut-tree and the OLA tree, wemention a couple of interesting observations. First, the total edge weight ofthe GH cut-tree (e.g., Figure 6(a)) equals the sum of all minimal cuts of itscorresponding graph (e.g., Figure 6(b)). This can be verified by calculating thenumber of times where each weight has been summed. For example, the weightbetween nodes 1 and 6 of Figure 6(b) is calculated twice in totaling the minimalcuts. Similarly, the weight of (1, 6) has been summed twice in obtaining theweights of the GH cut-tree, that is, once for edge (1, 2) (1.5 = 1 + 0.5) and theother for edge (2, 6) (1.6 = 0.4 + 0.7 + 0.5). Second, the distance between anytwo nodes, i and j , in the GH cut-tree equals the number of cuts lying on edge(i, j ) of the corresponding graph. Again, for example, there are two cuts on edge(1, 6) of Figure 6(b), and it can be found that the distance between nodes 1 and6 is 2, as shown in Figure 6(a). Theorem 2.2 below is proved based on these twoobservations.

Again, let G = (V , E) be a graph, T = (V , E ′) a spanning tree of G, and(i, j ) an edge of T . For brevity, let V (P ) be the vertex set of graph P , and Si(T )



and Sj (T ) the connected components obtained by removing edge (i, j ) from T ,where i ∈ V (Si(T )), j ∈ V (Sj (T )). The cut-tree cost, costCT(T, G), of a spanningtree T derived from G is the sum of each cut weight in constructing T . Clearly,the cut-tree cost is exactly

∑(u,v)∈E ′ weightG(cutG(V (Su(T )))). For example, the

cut-tree cost for the spanning tree of Figure 6(a) derived from the graph ofFigure 6(b) is the sum of all cut weights (i.e., the values of all edges crossed bythe dotted lines). In fact, the cut-tree cost can also be derived by summing allweights associated with each edge in Figure 6(a).

THEOREM 2.2. A GH cut-tree of G is a spanning tree of G with minimumlinear arrangement cost.

PROOF. Let T be a spanning tree of G. By Theorem 2.1, the weight collectionof the GH cut-tree is minimum. Therefore it suffices to prove costLA(T, G) =costCT(T, G) as follows. Let pathT (i, j ) be the edge set of the path connectingnodes i and j in T . Let [c] be 1 if condition c holds, and 0 otherwise. Clearly,costLA(T, G) is equal to∑

(i, j )∈E

weightG(i, j ) · dist(T, i, j )

=∑

(i, j )∈E

weightG(i, j ) ·∑

(u,v)∈pathT (i, j )

1

=

∑(i, j )∈E

weightG(i, j ) ·∑

(u,v)∈E ′[(u, v) ∈ pathT (i, j )]

=

∑(u,v)∈E ′

∑(i, j )∈E

(weightG(i, j ) · [(u, v) ∈ pathT (i, j )]

).

Now, each edge (i, j ) of E can be involved in the computation of costLA(T, G) inthe following two cases: (1) nodes i and j are in different components obtainedby removing (u, v) from T , that is, i ∈ V (Su(T )), j ∈ V (Sv(T )) or j ∈ V (Su(T )),i ∈ V (Sv(T )); (2) nodes i and j are in the same component obtained by removing(u, v) from T , that is, i ∈ V (Su(T )), j ∈ V (Su(T )) or j ∈ V (Sv(T )), i ∈ V (Sv(T )).Clearly, [(u, v) ∈ pathT (i, j )] = 0 for case 2, since the condition never holdswhen nodes u and v are not in pathT (i, j ). Therefore the theorem is proved by

costLA(T, G) =∑

(u,v)∈E ′

∑(i, j )∈E;i∈V (Su(T )); j∈V (Sv(T ))

weightG(i, j )

=∑

(u,v)∈E ′weightG(cutG(V (Su(T ))))

= costCT(T, G).

To further lower the power cost, we give a heuristic tree-clustering algorithmto incrementally (and locally) improve the switched capacitance [Chen et al.1999]. The algorithm arbitrarily selects an edge e in the bus tree already de-rived, and tries to merge the corresponding two end-nodes (devices) of e into ageneralized node (the same bus segment). If this merging results in a smaller


46 • Jone et al.

Fig. 8. The architecture of 80C51.

power cost, then these two nodes are melded into the same cluster. Otherwise,the algorithm gives up the merging process for e. This process repeats untilmerging two generalized nodes for any edge no longer reduces the switchedcapacitance. Note that the power cost is derived by enumerating the (weighted)switched capacitance for each pair of data communication using the formulak1(n− 1)+ k2 [Chen et al. 1999].

3. CHIP DESIGN AND IMPLEMENTATION

After the above theoretical analysis, implementation issues will be discussed inthis section to demonstrate the feasibility and efficiency of utilizing the bus seg-mentation technique. A popular ¹-controller, Intel’s 80C51 [Mackenzie 1992],will be used as the test vehicle, and its architecture (without implementing bussegmentation) is shown in Figure 8. Several building blocks, such as arithmeticand logic unit (ALU), static random access memory (SRAM), read-only memory(ROM), Timer/Counter, and input/output (I/O) ports, are found in this design,and they are all connected through a common bus. In this design, the width ofthe bus is only 8 bits and some components such as interrupt controller are notimplemented.

3.1 Design Flow and Design Environment

We have modified a typical application-specific integrated circuits (ASIC) de-sign flow to include the steps for implementing the bus segmentation technique.The complete design flow is shown in Figure 9, where the dotted lines repre-sent the design flow dedicated to the bus segmentation technique. Several extrasteps are required in the register-transfer level design and physical level design.First, after the RTL design has been finished for the ¹-controller, the circuitcomponents and buses can be determined. The application program must besimulated on the RTL description. The communication frequencies among thecomponents can be derived, and the bus graph can be generated as shown inFigure 10(a). Using the OLA tree analysis method discussed before, we cantransfer the bus graph into a bus tree (Figure 10(b)) where highly communi-cating components can exchange data by traveling the minimum number of



Fig. 9. The flow of complete design.

Fig. 10. (a) The bus graph. (b) The bus tree.

bus segments. Based on the bus tree, the location of each bus segmentationcell (BSC) can be determined. Note that each BSC physically implements thepass transistor for segment control and the tree-clustering process is not imple-mented. In the real implementation, each BSC is designed using an n-channelmetal-oxide semiconductor (NMOS) transistor. Finally, the controller must beredesigned such that proper control signals can be issued to control the BSCsto accomplish the required data communication for each instruction. Thus, theRTL for the controller and bus communication must be redesigned as shown inFigure 9(a).


48 • Jone et al.

The modified RTL description can then be synthesized into a gate level cir-cuit, and physical level design can be started. As shown in Figure 9(b), theBSC layout must be designed and included into the cell library with the siz-ing issue considered. The floorplan design also must consider the effect of BSCcells. Then, the BSCs will be placed and routed. Meanwhile, we know that thesegmented bus architecture is strongly dependent on the particular applicationprogram (AP). Two application programs have been studied in this research.The first program (AP1) was created to do the memory test, while the secondone (AP2) was designed for a keyboard controller. The bus graph and the bustree for implementing AP2 are shown in Figure 10.

The corresponding electronic design automation (EDA) tools used in eachdesign step are shown in Figure 9. The Verilog HDL is used to describe the RTLdesign, and the design compiler provided by Synopsys [Synopsys, Inc. 1994] isused to synthesize the circuits except the 128*8 SRAM and the 2K*8 programROM. Compass cell library [Compass Design Automation. Inc. no date] is usedfor the design which is technology-mapped to TSMC 0.6-¹m single-poly double-metal (SPDM) CMOS process [TSMC no date]. Finally, PowerMill [Epic DesignTechnology, Inc. no date] is used to characterize the power consumption of thepost-layout design.

3.2 Controller Design

The instrcutions of our ¹-controller consume one, two, or four machine cycles,each machine cycle has six stages, and each stage has two phases. Each in-struction performs data exchange between components through the bus (if nec-essary) under the control of machine cycles, stages, and phases. Thus, each bussegment might be turned on or off by a BSC signal for certain instructions,machine cycles, stages, and phases. In our example, an application program(AP) is stored in the embedded ROM of the ¹-controller. The instruction setof a ¹-controller is large, but not all of the instructions are used in this AP.Thus, we have developed a software tool to analyze the AP and identify a set ofnever-used and rarely used instructions.

The first approach for the controller optimization is to simplify the circuitthat generates the BSC signals. This can be done by removing the stage andphase signals out of the BSC control circuit for instructions which are rarelyused. We emphasize that the change in the control circuit will not affect thenormal function of data transfers among the functional blocks. The basic ideabehind the first approach is that synthesizing the BSC control circuit withoutconsidering the stage and phase information will reduce its size, and thus thepower consumption by the control unit will be decreased. However, synthesizingthe BSC control circuit without considering the stage and phase informationwill unnecessarily turn on some bus segments and thus consume more buspower. That is the reason why only rarely used instructions are allowed toperform this optimization. It should be noted that, using the first approach, theincrease in bus power consumption will rarely occur (since the correspondinginstructions are rarely used), but the saving for the BSC control circuit is validfor every instruction. Thus, it is beneficial in most cases.



Fig. 11. The guideline for BSC placement.

The second approach pursues a more ambitious optimization process by re-moving all instructions which are not used in the AP. Since most APs do notuse all instructions, taking unused instructions out of the control unit results insignificant circuit reduction and thus in the power consumption. For example,only about 37% of instructions are used in the AP of our case. Unlike the firstapproach, the second approach does not introduce the side-effect of consumingmore bus power since the instructions used in the AP are not affected at all.

3.3 Placement and Routing Guidelines

This subsection discusses several guidelines for physical design. (1) Better floor-plan always produces a smaller chip area and results in better chip performance.Especially, considering the implementation of the bus segmentation technique,the placement of the BSCs affects not only the whole chip area but also theinterconnection length of each bus segment. The interconnection length of eachbus segment will in turn affect the parasitic capacitance and the power con-sumption. (2) Those blocks that communicate very often must be placed closerin order to reduce their interconnection length. If the switching activity amongthe blocks is quite frequent, it is not even necessary to insert BSCs among them(by tree clustering). Furthermore, the pins of each block that are connected tothe BSCs must also be arranged to be as near the BSCs as possible. (3) When aBSC must be inserted between two blocks, its best position must also be deter-mined. If the total switching activity of all the buses connected to B2 is morethan that of all the buses connected to B1, as shown in Figure 11, the BSCbetween B1 and B2 must be adjusted to be as close to B2 as possible.

4. EXPERIMENTAL RESULTS

Different versions of the ¹-controller chip are simulated using EPIC Power-Mill after design rule check (DRC) and electrical rule check (ERC) processeshave been finished. Table I lists the features of each version. The features in-clude the embedded program tested (the RAM test program or our applicationprogram), bus segmentation, stage removal for BSC signals (if the utilizationrate is smaller than 1% using static analysis), and unused instruction removal.These features are represented by different symbols as shown in Table I. For


50 • Jone et al.

Table I. The Features of All Versions (the symbols used in the version names have the follow-ing meanings: “1” stands for AP1; “2” stands for AP2; “S” stands for segmented bus design; “F”

stands for full-custom design; “B” stands for removing stage for BSC signals; “U” stands forremoving unused instructions for controller)

Feature \ ¹-controller version 1 1SF 1SFB 1SFU 2 2SF 2SFB 2SFU

BSC with full-custom design v v v v v vRAM test code (AP1) v v v vApplication code (AP2) v v v vSegmented bus design v v v v v vRemove stage for BSC signals v vRemove unused instructions for controller v v

example, the ¹-controller of version 1SFU is designed using full-custom BSCs(represented by “F”) with the RAM test program (represented by “1”) executedby PowerMill. Moreover, the bus is segmented (represented by “S”) and allcontrol circuits for unused instructions are removed from the controller (rep-resented by “U”). Both application programs are stored in the ROM of the ¹-controller. However, for the AP1 test (versions 1, 1SF, 1SFB, and 1SFU), onlythe memery testing code is executed; for the AP2 test (versions 2, 2SF, 2SFB,and 2SFU), only the keyboard controller code is executed. Both are achieved byusing a long jump instruction. Design versions 1 and 2 are implemented withoutbus segmentation, while in others the bus segmentation technique is applied.When implementing the bus segmentation technique for design versions 1SFand 2SF, we consider only the communication frequencies between the blocks.For versions 1SFB and 2SFB, the stage information is further considered whengenerating the BSC control signals. For versions 1SFU and 2SFU, all unusedinstructions are removed from the controller design.

Here, we only show the chip layout for version 1SF in Figure 12, althoughall different versions have been implemented and fully simulated. Figure 13(a)shows the bus segments IB10, IB11, and IB12 of version 1SF. The intercon-nection structures are further detailed in Figure 13(b). It can be found fromFigure 13(a) that the long wires have been dramatically shortened by thecorresponding BSCs. For example, when ALU is communicating with RAM,we turn off BSC0–BSC3, and BSC5 to reduce the capacitance loading. Table IIdemonstrates the power consumption for all versions using different executionclock times (rather than CPU times). For example, each version starts execu-tion for 10 ms with the clock rate of 25 Mhz. Then, the chip is reset to theinitial state and executes for 15 ms, and the process is repeated until 100 ms isexecuted. The power consumption is estimated by averaging the current flowmeasured by mA using PowerMill.

According to Table II, the power consumption of version 1SF is reduced byabout 24.6% from the original version 1, by executing the RAM test mode.Thus, the power consumption can be significantly reduced if the bus is seg-mented by full-custom BSCs. If the stage controls for BSC signals are removedfor rarely utilized instructions, then the power consumption can be further re-duced by 25.55% (by comparing versions 1 and 1SFB). More ambitiously, if thecontrol circuits for all unused instructions are removed from the controller, the



Fig. 12. Version 1SF chip layout of the ¹-controller.

power consumption can be dramatically reduced by about 37.21% (by compar-ing versions 1 and 1SFU). The application program of a keyboard controller issimulated in versions 2, 2SF, 2SFB, and 2SFU. Again, one can see that signif-icant amount of power can be saved using the bus segmentation technique bycomparing version 2 with versions 2SF, 2SFB, and 2SFU. The task of testingRAM (AP1) is very trivial in that only a small number of components are in-volved. Thus, only a small number of bus segments need to be used (as shownin Figure 13(b)), and this provides a very good opportunity to test the perfor-mance extreme of the proposed method. From Table II, it can be observed thatthe power saved in RAM testing is much more significant than that in the caseof keyboard controller.

Table III gives the chip sizes for different versions of the ¹-controller. Eacharea includes the space occupied by I/O pads. About 6% of area overhead isintroduced when the bus segmentation technique is used (by comparing ver-sions 1 and 1SF or versions 2 and 2SF). However, the hardware overhead canbe completely compensated for by removing circuits from the ¹-controller. Forexample, when the control signals for all unused instructions are removed from


52 • Jone et al.

Fig. 13. (a) The locations of IB10-IB12. (b) The floorplan and IB10-IB12 routed.

Table II. Experimental Results for Power Consumption (each row corresponds to a version of¹-controller, each column corresponds to an amount of simulation time in ms, and each cell

is the amount of consumed power in mA; operating frequency is 25 Mhz)

Ver.\time 10 15 20 25 30 35 40 45 50 100

1 20.018 20.156 20.209 20.287 20.286 20.217 20.272 20.250 20.225 20.3841SF 16.126 15.843 15.624 15.583 15.501 15.396 15.405 15.367 15.339 15.360

1SFB 16.031 15.706 15.524 15.461 15.363 15.256 15.273 15.220 15.195 15.1751SFU 13.556 13.048 13.198 12.905 13.007 12.884 12.894 12.854 12.832 12.799

2 20.398 20.737 20.904 20.423 20.683 20.776 20.566 20.552 20.670 20.8222SF 18.210 18.011 18.073 17.694 17.751 17.763 17.630 17.585 17.657 17.711

2SFB 18.097 17.844 17.888 17.503 17.603 17.592 17.447 17.399 17.474 17.4942SFU 15.398 15.570 15.826 15.603 15.746 15.781 15.720 15.702 15.787 15.899



Table III. Chip Sizes of All Versions

Version Width (¹m) Height (¹m) Area (¹m2) Area ratio

1 3843.025 3745.700 14394818 11SF 4083.575 3730.900 15235409 1.0583953

1SFB 3944.150 3752.300 14799634 1.02812231SFU 3711.075 3550.625 12991081 0.9024831

2 3843.025 3745.700 14394818 12SF 4083.575 3730.900 15235409 1.0583953

2SFB 3944.150 3752.300 14799634 1.02812232SFU 3738.125 3649.150 13640978 0.9476312

the ¹-controller, the area is reduced by about 9.75% when versions 1 and 1SFUare compared, and by about 5.24% when versions 2 and 2SFU are compared.

5. CONCLUSION

We presented the design theory and implementation issues of a bus segmenta-tion method for lowering the power dissipation on system buses. If the highlycommunicating devices are clustered and separated from the rest of devices,the bus capacitance required to be charged/discharged becomes smaller andthe power dissipation can be saved. Computer simulation shows that the busdelay can also be improved [Chen et al. 1999]. The bus segmentation methodcan be used in ASIC design where the application program is fixed, so the com-munication frequency between each pair of devices can be well estimated bycomputer simulation. We have applied the method to a low-power ¹-controllerwhere the bus is partitioned into several bus segments. Power analysis basedon the postlayout simulation demonstrates that a significant amount of powercan be saved. We expect to save more power if the bus contains more data bits.In Mehra et al. [1997], a method proposed to partition the given algorithm intospatially local clusters will ensure the majority of data transfers take placewithin clusters. Thus, our method can be regarded as an extension of Mehraet al. [1997], with the objective of further reducing the power consumptionfor intercluster communications. This paper only concentrates on dynamic bustrees, and it will be interesting to extend the results to general bus structures.Additionally, guidelines for tree-clustering should be investigated.

ACKNOWLEDGMENT

We thank R. Ravi for bringing Hartvigsen and Margot [1995] to our attentionand showed us its relationship with the OLA tree.

REFERENCES

BELLAOUAR, A., ABU-KHATER, I. S., AND ELMASTRY, M. I. 1995. An ultra-low-power CMOS on-chipinterconnect architecture. In Symposium on Low Power Electronics. Digest of Technical Papers(Oct. 1995). 52–53.

CARDARILLI, G. C., SALMERI, M., SALSANO, A., AND SIMONELLI, O. 1996. Bus architecture for low-power VLSI digital circuit. In International Symposium on Circuits and Systems, vol. 4 (May1996). 21–24.


54 • Jone et al.

CAUFAPE, S. AND FIGUERAS, J. 1996. Power optimization for delay constrainted CMOS bus drivers.In European Design and Test Conference (Mar. 1996). 205–211.

CHANDRAKASAN, A. P. AND BRODERSEN, R. W. 1995. Low power Digital CMOS design. Kluwer Aca-demic Publishers, Boston, MA.

CHANDRAKASAN, A. P., SHENG, S., AND BRODERSEN, R. W. 1992. Low-power CMOS digital design.IEEE J. Solid-State Circ. 27, 4 (Apr.), 472–484.

CHEN, J. Y., JONE, W. B., WANG, J. S., LU, H.-I., AND CHEN, T. F. 1999. Segmented bus design forlow-power system. IEEE Trans. VLSI Syst. 7, 1 (Mar.), 25–29.

COMPASS DESIGN AUTOMATION, INC. no date. 0.6-¹m Passport Preliminary Design Kit. CompassDesign Automation, Inc.

EPIC DESIGN TECHNOLOGY, INC. PowerMill User Manual. Epic Design Technology, Inc., MountainView, CA.

GOLSHAN, R. AND HAROUN, B. 1994. A novel reduced swing CMOS bus interface circuit for highspeed low power VLSI systems. In International Symposium on Circuits and Systems, vol. 4 (May1994). 351–354.

GOMORY, R. E. AND HU, T. C. 1961. Multi-terminal network flows. J. SIAM 9, 4 (Dec.), 551–569.HARTVIGSEN, D. AND MARGOT, F. 1995. Multiterminal flows and cuts. Operat. Res. Lett. 17, 5, 201–

204.HIRAKI, M., KOJIMA, H., MISAWA, H., AKAZAWA, T., AND HATANO, Y. 1995. Data-dependent logic swing

internal bus architecture for ultralow-power LSI’s. IEEE J. Solid-State Circ. 30, 4 (Apr.), 397–402.IKEDA, M. AND ASADA, K.. 1994. A reduced-swing data transmission scheme for resistive bus lines

in VLSIs. In European Design Automation Conference (Feb. 1994). 546–550.LIU, D. AND SVENOSON, C. 1994. Power consumption estimation in CMOS VLSI chips. IEEE J.

Solid-State Circ. 29, 6 (June), 663–670.MACKENZIE, S. 1992. The 8051 Microcontroller. Macmillan Publishing Company, London, U.K.MEHRA, R., GUERRA, L. M., AND RABAEY, J. M. 1997. A partitioning scheme for optimizing inter-

connect power. IEEE J. Solid-State Circ. 32, 3 (Mar.), 433–443.NAJAGOME, Y., ITOH, K., ISODA, M., TAKEUCHI, K., AND AOKI, M. 1993. Sub-1-v swing internal bus

architecture for future low-power Ulsi’s. IEEE J. Solid-State Circ. 28, 4 (Apr.), 414–419.STAN, M. R. AND BURLESON, W. P. 1995a. Bus-invert coding for low-power I/O. IEEE Trans. VLSI

Syst. 3, 1 (Mar.), 49–58.STAN, M. R. AND BURLESON, W. P. 1995b. Coding a terminated bus for low power. In Fifth Great

Lake Symposium on VLSI (Mar. 1995). 70–73.SUNDSTORM, R., PHAM, P., ESGAR, D., AND PETTY, C. 1990. A low power differential bus untilizing

novel level bus technique. In Bipolar Circuits and Technology Meeting (Sept. 1990). 144–147.SYNOPSYS, INC. no date. HDL Compiler for Verilog Refernce Manual. Synopsys, Inc., Mountain

View, CA.TSMC. no date. 0.6-¹m Logic CMOS Process (0.6U-TW-14L-DS). TSMC, Hsin-Chu, Taiwan,

R.O.C.

Received August 1999; accepted January 2002


Date post:	20-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Design Theory and Implementation for Low-Power ...hil/paper/tdaoes03.pdfDesign Theory and...

Documents