Using constraint programming for the design of network-on-chip architectures

ComputingDOI 10.1007/s00607-013-0359-4

Using constraint programming for the designof network-on-chip architectures

Ayhan Demiriz · Nader Bagherzadeh ·Abdulaziz Alhussein

Received: 15 April 2013 / Accepted: 30 September 2013© Springer-Verlag Wien 2013

Abstract NoC technology is composed of packet-based interconnections, where thecommunication resources are distributed across the network. Therefore, the optimalresource utilization is a crucial consideration for efficient architectural designs. Thispaper studies the practicality of the Constraint Programming (CP) models for NoCarchitecture designs that effectively use a regular mesh with wormhole switching andthe XY routing. The complexity of the CP models is compared with the earlier MixedInteger Programming (MIP) models. Practical CP-based mapping and schedulingmodels are developed and results are reported on the benchmark datasets. Results indi-cate that mapping and scheduling problems can be solved at near optimality even underrelatively shorter run-time limits as compared to those required by the MIP models.

Keywords Constraint programming · Application mapping · Scheduling ·Network-on-chip

Mathematics Subject Classification 90C27 · 47N10 · 68M20 · 47N70

A. Demiriz (B)Sakarya University, Sakarya 54187, Turkeye-mail: [email protected]

N. BagherzadehCenter for Pervasive Communications and Computing (CPCC),University of California Irvine, Irvine, CA 92697, USAe-mail: [email protected]

A. AlhusseinKing Abdulaziz City for Science and Technology, Riyadh 92697, Saudi Arabiae-mail: [email protected]

123

A. Demiriz et al.

1 Introduction

High-performance system-on-chip (SoC) and chip multi-processors entail an increasein the number of required processor elements (PEs). However, the communicationperformance between these PEs is degraded due to the bus-system limitations. Sharedbuses become bottlenecks for system communications as the number of PEs increases.Network-on-chip (NoC) is a new technology developed to overcome limitations of theclassical bus and provide a scalable design for 100’s of cores [3]. NoC technology isbased on the packet switched interconnections, where the communication resourcesare distributed. PEs communicate and exchange data via packets using these sharedresources (links, buffers, switches, etc.). Each PE is attached to a router that connectthe PE to the network. Packets are transferred among routers according to such rout-ing algorithms where the packet flow and directions are defined. PEs are placed andorganized in a topology which has certain characteristic such as wires’ length, averagenumber of communication hop and degree of ports. Network topology includes mesh,torus, star and others.

While routing algorithms and network topologies influence the performance ofNoCs, mapping applications to the NoC architecture is the key for the overall perfor-mance. Mapping tasks to the PEs and optimizing NoC traffic among them determinethe actual performance metrics in terms of speed-up, power consumption, and thedegree of parallelization.

Mapping facilitates assigning and placing the IP cores onto PEs such that an objec-tive function (metric of interest) is minimized where a particular topology and corre-sponding task assignments are given beforehand [13]. Mapping can be considered asa sub-problem of floor-planning where finding the optimal topology is also consid-ered as a part of the problem. However, floor-planning and mapping are practicallyequivalent in a regular mesh topology with identical PEs. In the rest of the paper, floor-planning and mapping are used interchangeably. Mapping is related to the quadraticassignment problem (QAP) [11,19] which can be posed as mixed-integer program-ming (MIP) model and was shown to be NP-hard [9]. The QAP can generally bedefined as assigning facilities to the locations in order to minimize the total cost offlowing goods among these locations.

This paper extends our earlier work in [4] and explores the applicability of usingthe constraint programming (CP) models in finding solutions to the mapping andapplication scheduling problems. CP plays an important modeling role as being at theintersection of traditional mathematical programming and artificial intelligence (AI)[14]. The ability of utilizing intuitive and powerful optimization models that borrowmodeling ideas from mathematical programming and powerful search strategies fromAI makes the CP paradigm a good candidate to design efficient NoC architectures.Instead of complicated MIP models used in NoC synthesis earlier in [9,7] that donot necessarily return a solution within the run-time limits, we propose very intuitiveand efficient CP models which can achieve optimal or near-optimal solutions withinrelatively very short run-time limits.

Contributions of this paper are the following:– Application mapping and scheduling problems are successfully formulated and

modeled by the CP approach.

123

Using CP for designing NoC

– Wormhole switching is incorporated to the CP models to find the optimal appli-cation scheduling solutions based on deterministic XY routing.

– Fifteen different benchmark datasets are generated by varying the number of tasks,the number of links, and the communication volume to study the behavior of theCP models under random realizations of the task assignments.

– Our approach has been tested on real application benchmark datasets provided in[10] and it has also been compared to MIP model for solving the mapping problem.

– Packet latency is studied under deterministic conditions to observe its behavior bychanging the task complexities.

The remainder of this paper is organized as follows. Section 2 introduces the prelim-inaries of NoC platform and the related literature on the mapping problems. In Sect. 3,the problem formulation and the CP-based model are given to present the advantages ofour approach. Section 4 contains information on the experimental setup, further detailsof the CP models, and discussions of the results. Section 5 is the main extension inthis paper over our work in [4]. In this section, we give results from real applicationbenchmark datasets and we practically show that our approach is scalable for the realapplication data on 8 × 8 architecture. The paper is concluded and future directionsare highlighted in Sect. 6.

2 Preliminaries and previous work

2.1 Network-on-chip platform

PEs are organized into 2D mesh network where each PE’s router has four output ports(West, East, North and South) and one internal port connect to the PE. Deterministicrouting is adopted; in particular, dimension order routing (DOR) is used. Packets aredivided into multiple flits featuring 64-bit width, one header flits and maximum of 64body flits. Section II.A of [4] contains more details about the NoC platform.

The header flits require extra cycles to be transmitted. One clock cycle is consumedby the header processing unit (HPU) of the router to process the destination address.Another clock cycle is consumed by the arbiter to arbitrate between different inputs.Moreover, flit transmission to the next router takes one cycle. Body flits require onlyone clock cycle to be transmitted to the next router.

2.2 Power model

Power consumption in semiconductor technologies consists of static and dynamicpower. Static power consumption becomes a critical factor in the total power as thetransistors become smaller and faster. The leakage current is a significant factor whenthe technology is scaled down. On the other hand, dynamic power consumption isproduced by the switching activities in the semiconductor circuits.

The energy consumption analysis is presented in Section II.B of [4]. It has beenshown header flits consume more power because of the header processing and arbitra-tion circuits. EH f li t and EB f li t represent the energy consumption of header and data(body) flits respectively and furthermore can be expressed more by the following:

123

A. Demiriz et al.

EH f li t = Ebu f f er + Ecrossbar + Earbiter (1)

EB f li t = Ebu f f er + Ecrossbar (2)

Dynamic power consumption can be reduced when the energy of the packet is min-imized. This can be achieved by minimizing the packets communication energybetween PEs. Mapping of the tasks plays an essential role of determining the packettraveling paths which reduces the number of hops. On the other hand, tasks schedulingaffects the buffering requirements of packets along with the travel path.

2.3 Previous work

A branch-and-bound algorithm is proposed in [9] to find near optimal mapping solu-tions that are energy aware and satisfy the bandwidth constraints under some per-formance constraints. The proposed mapping algorithm in [9] optimizes the energyrequirements based on the random assignment of tasks to the IP cores. So the mappingproblem is to place IP cores in appropriate locations on the regular grid in order tominimize energy based on the network traffic [9]. Authors in [18] propose a com-prehensive two-stage NoC synthesis model by utilizing the MIP. In the first stage, anenergy efficient system-level floor-planning is achieved through MIP. The second stageis conducted for a detailed routing functionality. At stage two, placement of routers isoptimized to enable the traffic flow. The MIP model is very complicated in [18] andit often does not return a solution within the run-time limits. So a clustering-basedheuristic is proposed to address the complexity issue of the second stage. It shouldbe noted that if a certain level of the problem abstraction is not applied appropriatelyin the MIP models, it is very likely that the MIP models will not be able to return asolution within the run-time limits due to complexity issues. From the earlier work, itis evident that the MIP implementations may suffer gravely from two shortcomings:misjudging the level of problem abstraction and the difficulty in accurate modeling.

The CP paradigm is used in [2,16] on scheduling problems of multi-task appli-cation on multi-processor system-on-chip (MPSoC) after using the MIP models forfinding allocation solutions. Thus, mapping and scheduling problems are effectivelydecomposed [2,16] into two sub-problems and solved in tandem like in our implemen-tation. Computational efficiency of the decomposition method was shown in [16] incomparison to the full optimization models. Therefore, floor-planning and applicationscheduling tasks are conducted in two different stages as a result of decomposition.However, we utilize the CP approach for both mapping and scheduling problems ratherthan using the MIP for the mapping stage and the CP for the scheduling stage in con-trast with [16]. Another difference in our implementation is that data communication ismodeled at packet level, therefore scheduling has been conducted in much finer detailcompared to [16]. On the other hand, [16] runs mapping and scheduling successivelyin an iterative manner to find a converged solution. Reference [12] which is aboutCP and MIP implementations of general allocation and scheduling problems providesa relevant survey paper which shows how practical to solve both problems in twostages.

123


3 Motivation and problem formulation

3.1 Basic assumptions and overview

Assume a set of PEs, organized as a 2-D mesh of dimensions n = m × m.Since all of the PEs are identical, the architecture is homogeneous. Each PE canbe labeled according to its position on the mesh (as an x and y coordinate pair)and has the capability of executing several application tasks in tandem. Becauseall the PEs on the mesh are identical, there is no difference in the task executiontime.

The task set is represented by an annotated task graph (TG), e.g. the one shownin Figure 2 of [4] which depicts a sample task graph generated by TGFF [5]. Eachnode in the graph represents a task of the application with its execution time given inthe parentheses. Communication requirements (flits) between tasks are shown on theedges (links). TGFF was used in generating several benchmark task graphs that havebeen used in the literature before (see [6,18]).

The cost of transferring data from one PE to another is a function of the rout-ing algorithm. Deterministic routing algorithms can easily be incorporated to theoptimization problem on hand and can help in finding the optimal application taskschedule. The transfer cost based on the XY routing algorithm is proportional to theManhattan distance which can be calculated between points a and b on a grid as:TCab = |xa − xb| + |ya − yb|.

The buffer size is assumed to be unconstrained. Any communication link can onlybe occupied by a single packet at any given time without any constraint on the band-width size. The communication links are bi-directional. In other words, any particularlink between two routers can be considered as a separate link (resource) in eachdirection.

3.2 Problem formulation

The basic mapping problem is an instance of the QAP and can be formulated as amathematical program model given in Eq. 3 where the decision variable xi j , i, j =1, . . . , n is a Boolean variable, f and d are flow and distance matrices respectively.In general, QAP instances that have size of n > 30 cannot be solved in a reasonabletime [11].

minimizex∈X

n∑

i, j,k,l=1

fikd jl xi j xkl

n∑

j=1

xi j = 1, i = 1, . . . , n,

s.t.n∑

i=1

xi j = 1, j = 1, . . . , n,

xi j ∈ {0, 1}, i, j = 1, . . . , n.

(3)

123

A. Demiriz et al.

where the decision variable, xi j simply determines whether goods (or information inour context) should be sent from node i (P Ei ) to node j (P E j ). The parameter d isthe Manhattan distance between all the possible pair combinations of the PEs on themesh. Parameter f represents the information to be sent between each PE. They areboth constants throughout the optimization.

The objective function in Eq. 3 is a representation of the dynamic power and it canbe simplified by introducing a permutation variable, π , as below [11]:

minimizeπ

n∑

i, j=1

fi j dπi π j

where πi = j, if xi j = 1 for i, j = 1, . . . , n. Introducing the permutation variableenables us to use the CP by invoking alldifferent constraint which specifiesthe values assigned to the decision variables that must be pairwise distinct [8] on thepermutation variable,π as in Eq. 4. In other words, each of the n variables should beassigned a unique value between 1 and n.

minimizeπ

n∑

i, j=1

fi j dπi π j

s.t. alldifferent(π1, . . . , πn),

πi ∈ {1, . . . , n}, i = 1, . . . , n.

(4)

As given in Eq. 4, the QAP can be modeled in a very intuitive way by utilizing theCP approach. Instead of n2 Boolean decision variables and 2n constraints besides 0-1integrality constraints in Eq. 3, there are only n decision (permutation) variables and asingle alldifferent constraint besides the integrality constraints. Aside from thesimplicity of the model representation, the central strength of the CP approach is toconstruct efficient propagation search techniques to detect the dead-ends on the searchtree as early as possible and prune them [8]. This approach depends on the efficientfiltering of the domain which is defined as a finite set of elements that can be assignedto a decision variable. Efficient arc consistency algorithms exist to find solutions thatsatisfy the alldifferent constraint. In general, alldifferent can be checkedfor consistency, i.e. to be determined to have a feasible solution, in O(z

√n) time [8]

where n is the number of decision variables and z is the sum of cardinalities of eachdomain that belongs to a decision variable. Considering that z = n2 in our problem, thecomplexity of consistency checking of alldifferent constraint is then O(

√n5).

Once the solution to the floor-planning problem determines the position of the PEson the mesh, the application task scheduling problem can be posed as a separate math-ematical programming model. Recall that all the PEs are identical and the processingtimes for a given task are equal for all the PEs. Hence, the optimal solution found inStage I can be considered as the best floor-planning assignment that does not con-sider the bandwidth constraints. In other words, it is a relaxed solution without thebandwidth constraints. The application scheduling problem in Stage II becomes chal-lenging when the bandwidth constraints are introduced. The new scheduling problem

123


can be posed as a constraint programming model by letting appropriate decision vari-ables to represent start time, end time, and processing time of tasks. The advantagesof using CP in scheduling are twofold [1]:

– Natural and flexible way of modeling the scheduling problems by the CP.– Efficient temporal and resource constraints.

For example, precedence constraints can be represented by endBeforeStart,endBeforeEnd, endAtStart, and endAtEnd constraints easily. AnendBeforeStart constraint requires a job to end before the other job starts. In otherwords, given tasks i and j , endBeforeStart constraint means end(i) ≤ start ( j).Similarly, an endAtStart constraint requires a job to end at the start of the otherjob. A noOverlap constraint can be used to schedule the tasks that utilize certainresources. Thus, precedence constraints can be constructed from the task graph. Com-munication links between routers can be considered as resources to model bandwidthrequirements. We can then easily use noOverlap constraint to find a schedule oftransmission tasks through a particular communication link without violating prece-dence and resource availability constraints.

4 Experimental study

This section illustrates the applicability of our constraint programming based approachon randomly generated benchmark datasets.1 Throughout the experiments, we assumethat channels can hold 64-bit flits, the packet size is 64 flits, and a 1,000 MHz archi-tecture utilized. Tasks execution time and packets transmission delay are measured inclock cycles and therefore clock operating frequency does not influence the mappingand scheduling performance. The header flits require extra time of three clock cycles.The CP models are implemented by using IBM ILOG OPL Studio, which is availablefree of charge to the academicians at IBM Academic Initiative web site.2 Interestedreader may consult with Section IV of [4] for further details of experimental setup.

4.1 Implementation details of the proposed model

There are practically two separate CP models, one for each stage. Stage I is primarilycomposed of the CP model given in Eq. 4. Since 3 × 3 and 4 × 4 regular meshesare used in our implementation, there are only nine and sixteen decision variablesrespectively for the CP model in stage I and the domain (D) of these variables are{1, . . . , 9} and {1, . . . , 16} respectively. There are two main input parameters for thisstage besides the random assignment of the tasks to PEs: namely flow ( f ) and distance(d) matrices. The distance matrix, d, represents the transfer cost T C for the XY-routing,i.e. Manhattan distances between the nodes on meshes which are 3×3 and 4×4 in ourimplementation. The flow matrix, f , expresses the total communication requirements

1 All the datasets and optimization models can be downloaded at http://tinyurl.com/cdq5l9n.2 http://tinyurl.com/cu5txlg.

123

http://tinyurl.com/cdq5l9n

http://tinyurl.com/cu5txlg

A. Demiriz et al.

(data flits) between each task given on the edge of task graphs (see Figure 2 of [4])with additional header flit overhead for each data packet transferred.

The mapping problem is poorly-defined without randomly assigning tasks to thePEs first. Otherwise, it is always feasible to assign all of the tasks to a single PE whichwill result in ‘0’ energy consumption i.e. null solution for all the data communicationsof tasks. The random assignment is devised in such a way to guarantee the assignmentof at least one task to each PE. However, each PE is expected to serve for multipletasks as equally likely.

After producing an assignment solution at Stage I by solving CP model given inEq. 4, we can now schedule all the tasks and their corresponding communicationtasks as well. The second stage is more complicated than the first one in terms ofthe modeling as the detailed implementation of the routing algorithm is required. Theobjective of Stage II can be set to minimizing the maximum completion time (make-span) among all the tasks. To enable an easier scheduling implementation, one canconsider communication tasks among PEs at a very detailed level to generate prece-dence constraints appropriately. The reason for this is primarily the implementationof wormhole switching algorithm with XY routing.

We propose the creation of a data structure at flit type level given in Figure 3 of [4] tomanage this implementation. Benefits and implementations of this data structure canbe found in Section IV. B of [4]. By carefully crafting the data structure given in Figure3 of [4], the precedence constraints which are used to implement the wormhole switch-ing and the XY routing can be easily devised in order to find a solution to the applicationtask scheduling. As mentioned in Section 3 endBeforeStart, endBeforeEnd,endAtStart and endAtEnd constraints can accordingly be used in Stage II CPmodel. Notice that the physical links between routers should be treated as resources inthe application task scheduling problem. When we consider these links as resources,it is easy to implement the bandwidth constraints by noOverlap constraints. Noticethat we do not have a bandwidth size limitation in our implementation, however, ourformulation implicitly constrains the bandwidth size as equal to the packet size. There-fore, only one packet can be served at any given time for a particular link (physical link)on the network. All the constraint types used in this study including alldifferentcan be implemented in any modern CP language without any difficulty. IBM ILOGsolver, our choice of solver, can handle such constraints efficiently as being the bestcommercial solver available.

The run-time limits for Stage I and Stage II CP Models are both set to 1000 secondstotal CPU time in our experiments. Considering that four processors are set to be usedby the IBM ILOG solver, CP models return optimal or the best solution within 250 sat each stage. This is the way CPU time is controlled by solver, therefore, there is noneed to report the actual run times.

4.2 Results

We conducted the experiments for fifty random realizations of the instances thatare summarized in Table 1 of [4]. A random realization is essentially a set of ran-dom assignments of the tasks to the PEs. All of the experiments ran successfully

123


and returned solutions without any failure. Compared to very limited the numberof tasks and the number of links used in the experimental studies previously (e.g.around 30 in [2], at most 24 in [6]), the task graphs generated for our experimentsinclude graphs with 77 tasks and 110 links at most. It should be noted that eventhe benchmark datasets used in the previous work such as [6,18] were generated byTGFF [5].

We observed the objective values of Stage I and Stage II CP models, these aresummarized by the box-plots depicted in Figure 4 of [4] and Figure 5 of [4] respectivelyfor the 3 × 3 mesh architecture. The range of the distribution can be easily visualizedby a box-plot. Outliers are presented at the tail or the head of the box-plot as circles.As the complexity of the underlying problems increases, the objective values of theCP models increase as well, since they are minimization problems.

Power models given in Sect. 2 indicate relationship between energy consumptionand the number of hops that packets that traverse between PEs. Therefore, it is assumedthat the actual power usage will be a function of the current objective function of StageI CP model which is the sum of information flow multiplied by the distance betweenPEs. Therefore, our formulation successfully addresses the issue of energy-awaredesign of NoC architecture. The completion time of all tasks in a task graph is usedas the objective function in Stage II CP model i.e. makespan. Some other objectivessuch as task (job) tardiness or lateness can be used as well. But it is still safe to useuniversally the latest task completion time as the objective function to minimize inscheduling problems. Notice that, none of the scheduling problem cases was found tobe infeasible, all of them returned the best solution.

The distributions of latency at packet level for each instance are depicted in Figure 6of [4] for 3×3 mesh architecture. Due to the network congestions by increasing com-plexity of CTGs, it is expected to have higher latencies for the packet transmissions.We can conclude from Figure 6 of [4] that the variation of the packet latency increasesas well by increasing the complexity of the CTG with respect to the congestion of thenetwork.

Figures 7, 8, 9 of [4] summarize the results from the 4×4 mesh architecture. Whenwe compare Figures 5 and 8 of [4], we can see the benefits of the extra resources (i.e.PEs) effects on the clock time. As expected the 4 × 4 mesh architecture completesthe tasks earlier than the 3 × 3 mesh architecture. On the other hand, the applicationscheduling task becomes more complex by increasing the mesh size due to the extradecision variables and the resources in the model as seen in Table 1 of [4]. However,the CP models were able to return the results within the same run-time limits as theexperiments for the 3 × 3 mesh architecture.

5 Application on real benchmark datasets

In addition to the tasks data sets that were synthetically generated by a random problemgenerator TGFF, we have employed real application benchmark datasets to evaluatethe mapping and the scheduling algorithm. Multi-constraint system-level (MCSL)benchmark suite [10] provides a set of real applications where each application com-poses multiple tasks and traffic data patterns between these tasks. MCSL benchmark

123

A. Demiriz et al.

Table 1 MCSL benchmarksuite applications

Application Number of tasks Number ofcomm. links

R-S code encoder 248 328

R-S code decoder 278 390

ROBOT 88 131

SPEC95 FPPPP 334 1,145

SPARSE 96 67

H.264 video decoder 2,311 3,461

Table 2 MCSL 4×4 mesh results (Average ± Std. dev.)

Stage I Stage II

Application Objective Num. of sol. Objective Num. of sol. Run time (s) Latency

R-S32ENC 1,324±0 7±0 1821±18.73 1.6±0.5 0.71±0.26 5±0.01

R-S32DEC 2,208±0 9±0 2,961±151 3.8±1.8 1,600±0 12.37±0.45

ROBOT 10,169±120 19±5 92,818±1,914 1±0 1.59±0.35 7.91±0.07

FPPPP 161,021±914 15.7±3.68 85,371±5,914 1.6±0.7 385±443 149.4±6.8

SPARSE 22,206±365 19.5±5 19,982±1,101 1.2±0.4 4.3±2.24 14.49±3.1

H.264 1,501,505±3,007 19±2.36 19,532,265±727,341 1±0 23,335±597 632±90

records the data traffic for different mesh network sizes and measures the executiontime for each task in the application.

Table 1 shows the applications provided by MCSL which were used as data sets ofour mapping and scheduling algorithms. Table 1 show also the number of task of eachapplications as well as the number of communication edges.

The files obtained by MCSL benchmark includes tasks execution times, the detailsof communication links between task, and the amount of data on each communicationlink. The execution time is represented by clock cycles while the data is representedby number of words on each link. Words are 32-bit wide which corresponds to one flit.Each packet contains one header flit and eight data body flits. Between two networknodes, header flits require three clock cycles while data flit require only one clock cycle.

For each different application in MCSL Benchmark, we generated 10 differentrandom realizations of the execution times and the traffic patterns according to distri-butional parameters provided in the benchmark data specifically the files with ‘STP’extension [10]. Two different sizes of the mesh architecture were utilized in our exper-iments: 4×4 and 8×8. The packet size was set to be eight for all the applicationsexcept H.264 which was set to be 64 due to the computational complexity caused bythe small packet size in Stage II. The CP models for both Stage I and Stage II used forexperiments in Sect. 4 were utilized without any change.3

Table 2 presents the results from experiments on 4×4 mesh architecture. All theresults were averaged over ten different random realizations and standard deviations

3 All the related model and data files can be accessed at http://tinyurl.com/cjseuuz.

123

http://tinyurl.com/cjseuuz


Table 3 MCSL 8×8 mesh results (Average ± Std. dev.)

Stage I Stage II

Application Objective Num. of sol. Objective Num. of sol. Run time (s) Latency

R-S32ENC 1,380 ± 0 41 ± 0 1,752 ± 18.64 2.6 ± 0.5 1.9 ± 0.82 5 ± 0.03

R-S32DEC 4,020 ± 0 14 ± 0 2,959 ± 150 12.6 ± 1.7 3,200 ± 0 17.38 ± 0.82

ROBOT 9,808 ± 299 50 ± 9 92,452 ± 1,871 1 ± 0 1.43 ± 0.25 7.80 ± 0.22

FPPPP 304,282 ± 3,506 22.8 ± 6.66 85,317 ± 5,913 1.3 ± 0.7 499 ± 502 196.4 ± 9.4

SPARSE 19,125 ± 495 58.1 ± 8.8 20,012 ± 1,101 1.2 ± 0.4 7.1 ± 3.6 17.73 ± 5.7

were also reported after the ± operator. The total CPU time limit was set to be 1,000 sin all the experiments for the application mapping model (i.e. Stage I) except H.264which was set to 3,600 s. We set the time limit to 1,600 s in Stage II (i.e. the applicationscheduling model). We reported the objective values and number of different feasiblesolutions in both Stage I and Stage II CP models. None of Stage I models ran to theoptimality. Therefore reported results are based on the best solutions found until therun time limit. However, all Stage II models produced the optimal scheduling exceptfor the R-S32DEC and H.264 applications. We also reported CPU times for StageII which differ from the run time limits as most scheduling problems were solvedto the optimality within seconds. The most challenging problem was the applicationscheduling part of the H.264 benchmark. To give an idea about the complexity, arepresentative problem might have over 67 k variables and over 79 k constraints forthe H.264 benchmark data in Stage II. Table 2 also reports the latencies in the unit ofclock cycles.

Similarly, Table 3 presents results from experiments on the 8×8 mesh architecture.However, the results from H.264 were omitted as the second stage becomes too com-plex and results in memory problems because there are around 137 k variables and178 k constraints. The CPU time limits were set to twice as much as the ones in the4×4 experiments. Since the CPU time limits were higher, the CP solver was able tofind more feasible solutions in the first stage (application mapping). However, the CPsolver was not able to guarantee that Stage I results were optimal as in the case of 4×4experiments.

A representative sample of the progression of the objective value at Stage I isgiven in Fig. 1. The x-axis represents the number of branches that are generated thusfar. Basically, the CP solver was able to sift through approximately 700 k brancheswithin CPU time limit which is 2,000 s for the 8×8 experiments. Each objective valuecorresponds to a unique solution. At the end of the run, the CP solver reports the bestsolution found within the CPU time limits.

Finally, we compare the alternative modeling approaches for the mapping problemin this section; namely MIP and CP models. Table 4 reports average objective valuesof Stage I over 10 random realizations for both 4×4 and 8×8 mesh architectures. Ourimplementation of the MIP model was adapted from [17]. To prevent the null (i.e. thezero objective value) solution, the model used in [17] incorporates a new constraintnot to assign a specific core to the same position on the mesh as its label. Therefore,

123

A. Demiriz et al.

20

25

30

35

40

45

0 100 200 300 400 500 600 700

Obj

ectiv

e V

alue

(in

thou

sand

s)

Iteration Number (in thousands)

Progression of Objective Value at Stage I

Fig. 1 Progression of objective value at stage I of SPARSE dataset on 8 × 8 architecture

Table 4 Comparing avg. objective values of MIP and CP on stage I model

Application MIP Results on4×4 Mesh

CP Results on 4×4Mesh

MIP Results on8×8 Mesh

CP Resultson 8×8Mesh

R-S code encoder 13,48 1,324 9,840 1,380

R-S code decoder 2,604 2,208 7,002 4,020

ROBOT 13,270 10,169 71,254 9,808

SPEC95 FPPPP 160,072 161,020 395,005 304,283

SPARSE 26,334 22,206 95,983 19,125

H.264 video decoder 2,663,004 1,505,505 NA NA

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60

Obj

ectiv

e V

alue

(in

thou

sand

s)

Number of Nodes (in thousands)

Prog. of Obj. Values in Stage I - MIP Model

Fig. 2 Progression of object value of stage I MIP model on SPARSE dataset on 4 × 4 architecture

there were some differences between objective values of MIP and CP models and theCP models yielded better results in almost all the cases except the one designated asSPEC95 FPPPP. Our MIP implementation is also provided in the experimental files

123


available in the web page mentioned in the footnote above. Run time limits werechosen to be comparable to the experiments from the CP models.

A representative sample of progression of the MIP solution is given in Fig. 2. Therewere nearly 270 k nodes generated in this particular run, but the x-axis of Fig. 2 endsaround 60 k for the visual clarity as the objective value does not change anymore.

6 Conclusion and future work

A CP-based two-stage model is proposed to solve the mapping and the applicationscheduling problems listed in [13] as outstanding research problems in applicationmodeling and optimization for NoC architectures. The major advantage of using CPis the clarity and understandability of the models. The CP modeling is more flexiblethan the MIP on many challenging optimization problems, including mapping andscheduling. We successfully experimented our model on various benchmark datasetsgenerated by TGFF. Deterministic XY-routing with wormhole switching is success-fully used for the task and the data communication scheduling.

The proposed approach in this paper has also been applied to real application bench-mark datasets. Results show clear advantages of CP models when they are used both formapping and scheduling problems. Application benchmark datasets have purposelybeen chosen to be challenging where several thousands of tasks and communicationlinks form the traffic patterns.

A logical extension to the homogeneous platform is to study the case of hav-ing heterogeneous PEs. The challenge of dynamic voltage frequency island (VFI) isan open problem in terms of providing robust models [15,6]. Therefore, an exten-sion of the CP model should be in the direction of studying VFI on heterogeneousplatforms and adaptive routing. The work in this paper can be extended to realapplications on various processing platforms beyond 8×8 mesh architecture in thefuture.

References

1. Baptiste P, Laborie P, Pape CL, Nuijten W (2006) Handbook of constraint programming, chap 22.Constraint-based scheduling and planning. Elsevier, Amsterdam, pp 761–799

2. Benini L, Bertozzi D, Guerri A, Milano M (2005) Allocation and scheduling for mpsocs via decompo-sition and no-good generation. In: van Beek P (ed) Principles and practice of constraint programming—CP 2005. Lecture notes in computer science, vol 3709. Springer, Berlin, pp 107–121. doi:10.1007/11564751_11

3. De Micheli G, Seiculescu C, Murali S, Benini L, Angiolini F, Pullini A (2010) Networks on chips:from research to products. In: Sapatnekar SS (ed) DAC. ACM, pp 300–305

4. Demiriz A, Bagherzadeh N, Alhussein A (2013) Cpnoc: On using constraint programming in designof network-on-chip architecture. In: Parallel, Distributed and Network-Based Processing (PDP), 201321st Euromicro International Conference on, pp 486–493. doi:10.1109/PDP.2013.78

5. Dick RP, Rhodes DL, Wolf W (1998) Tgff: task graphs for free. In: Borriello G, Jerraya AA, LavagnoL (eds) CODES, IEEE Computer Society, pp 97–101

6. Ghosh P, Sen A (2010) Efficient mapping and voltage islanding technique for energy minimization innoc under design constraints. In: Shin, SY Ossowski, S Schumacher, M Palakal, MJ Hung CC (eds.)SAC. ACM, New York, pp 535–541

123

http://dx.doi.org/10.1007/11564751_11

http://dx.doi.org/10.1007/11564751_11

http://dx.doi.org/10.1109/PDP.2013.78

A. Demiriz et al.

7. He O, Dong S, Jang W, Bian J, Pan DZ (2011) Unism: unified scheduling and mapping for generalnetworks on chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, PP(99):1–14.doi:10.1109/TVLSI.2011.2159280

8. van Hoeve WJ, Katriel I (2006) Handbook of constraint programming, chap 6 Global Constraints.Elsevier, Amsterdam, pp 169–208

9. Hu J, Marculescu R (2005) Energy- and performance-aware mapping for regular noc architectures.IEEE Trans CAD Integr Circuits Syst 24(4):551–562

10. Liu W, Xu J, Wu X, Ye Y, Wang X, Zhang W, Nikdast M, Wang Z (2011) A noc traffic suite based onreal applications. In: VLSI (ISVLSI), 2011 IEEE Computer Society Annual Symposium on, pp 66–71.doi:10.1109/ISVLSI.2011.49

11. Loiola EM, de Abreu NMM, Netto POB, Hahn P, Querido TM (2007) A survey for the quadraticassignment problem. Eur J Oper Res 176(2):657–690

12. Lombardi M, Milano M (2012) Optimal methods for resource allocation and scheduling: a cross-disciplinary survey. Constraints 17(1):51–85. doi:10.1007/s10601-011-9115-6

13. Marculescu R, Ogras ÜY, Peh LS, Jerger NDE, Hoskote YV (2009) Outstanding research problems innoc design: system, microarchitecture, and circuit perspectives. IEEE Trans CAD Integr Circuits Syst28(1):3–21

14. Michel L, Van Hentenryck P (2010) Wiley encyclopedia of operations research and managementscience, chap. Basic CP theory: search. Wiley, New York. doi:10.1002/9780470400531.eorms0088

15. Ogras ÜY, Marculescu R, Marculescu D, Jung EG (2009) Design and management of voltage-frequencyisland partitioned networks-on-chip. IEEE Trans VLSI Syst 17(3):330–341

16. Ruggiero M, Guerri A, Bertozzi D, Milano M, Benini L (2008) A fast and accurate technique formapping parallel applications on stream-oriented mpsoc platforms with communication awareness.Int J Parallel Program 36:3–36. doi:10.1007/s10766-007-0032-7

17. Shcherbina O. Quadratic assignment test problems. http://www.mat.univie.ac.at/~oleg/TEST/Chapter13/HCP13.5.1.mod. Accessed Apr 2013

18. Srinivasan K, Chatha KS, Konjevod G (2006) Linear-programming-based techniques for synthesis ofnetwork-on-chip architectures. IEEE Trans VLSI Syst 14(4):407–420

19. Zhang H, Beltran-Royo C, Constantino M (2010) Effective formulation reductions for the quadraticassignment problem. Comput OR 37(11):2007–2016

123

http://dx.doi.org/10.1109/TVLSI.2011.2159280

http://dx.doi.org/10.1109/ISVLSI.2011.49

http://dx.doi.org/10.1007/s10601-011-9115-6

http://dx.doi.org/10.1002/9780470400531.eorms0088

http://dx.doi.org/10.1007/s10766-007-0032-7

http://www.mat.univie.ac.at/~oleg/TEST/Chapter13/HCP13.5.1.mod

http://www.mat.univie.ac.at/~oleg/TEST/Chapter13/HCP13.5.1.mod

Date post:	23-Dec-2016
Category:	Documents
Upload:	abdulaziz
View:	214 times
Download:	1 times

Using constraint programming for the design of network-on-chip architectures

Documents