+ All Categories
Home > Documents > [IEEE 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based...

[IEEE 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based...

Date post: 30-Sep-2016
Category:
Upload: hamid
View: 216 times
Download: 4 times
Share this document with a friend
8
Multicast-Aware Mapping Algorithm for on-Chip Networks Amirali Habibi, Mouhammad Arjomand, Hamid Sarbazi-Azad Department of Computer Engineering Sharif University of Technology Tehran, Iran e-mail: [email protected], [email protected], [email protected] Abstract— Networks-on-Chip (NoCs for short) are known as the most scalable and reliable on-chip communication architectures for multi-core SoCs with tens to hundreds IP cores. Proper mapping the IP cores on NoC tiles (or assigning threads to cores in chip multiprocessors) can reduce end-to-end delay and energy consumption. While almost all previous works on mapping consider higher priority for the application’s flows with higher required bandwidth, a mapping strategy, presented in this paper, is introduced that considers multicast communication flows in addition to the normal unicast flows. To this end, multicast and unicast traffic flows are first characterized in terms of some new metrics which are then used for arranging communication flows based on their volume and priority. A heuristic approach is used to assign IP cores to NoC tiles. Simulation results for both synthetic and real applications show up to 49% (28% on average) performance improvement and 44% (22% on average) energy saving when compared to the best known mapping algorithm, nMap. Keywords: Network-on-Chip, Multicast, Mapping I. INTRODUCTION Traditional on-chip interconnect schemes, including point-to- point links and shared buses, face technological and design problems as the system size (in terms of number of IP cores) grows. Point-to-point links suffer from poor scalability and considerable area overhead although they provide ideal delay and power profiles. Applicability of shared buses to future many core SoCs (with more than tens or hundreds IP cores) is also under question [6,8]. Hence, to reduce area overhead and increase efficiency of on-chip interconnect, packet switched networks-on- chip (NoCs) have attracted much attention [4,6,8]. Analytical and experimental evaluation of various on-chip interconnect design choices is presented in [13] where various on-chip interconnecting schemes (including point-to-point, shared bus, and NoC) are evaluated; it is then deduced that the NoC is the most predictable one with the error between estimated and actual delays obtained after layout extraction. Different design requirements in the NoC context pose new challenges in developing design methodologies which do not exist in the traditional interconnection networks. The limited on-chip resources force a designer to find a way to distribute the application load over NoC resources (including switches, buffers, virtual channels, and wires) so that the metrics of interest are optimized and the application requirements are satisfied. This is known as the topological mapping problem. Given the average communication volumes between IP cores for an application, a mapping strategy determines how to assign IP cores to NoC tiles in order to minimize point-to-point delay [3,8,17-19,21,24], provide the required bandwidth for running applications [9,10,17,18], reduce power consumption [3,9,10,21] or lower the chip temperature [2,11]. Almost all previously proposed topological mapping techniques in the NoC context [2,3,8-11,17,18,18,21,24] first try to identify communication flows of high bandwidth and then prioritize heavy flows for mapping. However, many scientific applications, including PARSEC [5] and SPLASH [23], involve multicast operations for data sharing, consistency, or network control [7,12]. Besides, in many SoC applications, input data can be effectively distributed over IP cores via multicast/broadcast operations. Multicast operations usually use and occupy huge amount of network resources which can lead to increased network congestion and hence higher network latency and power consumption. Indeed, prior topological mapping strategies give the lowest priority to such multicast messages due to their low generation rate and small size. To mitigate performance degradation of multicast traces, we introduce a topological mapping scheme that takes the effects of multicast communication into account. More precisely, the proposed algorithm considers the bandwidth requirements and priority of communication flows (to guarantee the QoS) for both multicast and unicast operations. To achieve this, we introduce a design flow which first extracts the communication characteristics of a given application with respect to multicast and unicast traffic volumes, and then uses the extracted values to find a mapping that can efficiently handle both multicast and unicast flows. The study presented in [26] is the most relevant work in which a methodology for synthesizing custom NoC architectures considering both multicast and unicast communication patterns is presented. To this end, the synthesizing and floor-planning problem is decomposed into some inter-related steps of finding proper flow partitions. They are then used to determine a proper physical network topology for each group in the partition and provide an optimized network implementation for the derived topologies. The remainder of this paper is organized as follows. In Section II, we formulate the multicast-aware mapping problem. Section 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing 1066-6192/11 $26.00 © 2011 IEEE DOI 10.1109/PDP.2011.76 455
Transcript

Multicast-Aware Mapping Algorithm for on-Chip Networks

Amirali Habibi, Mouhammad Arjomand, Hamid Sarbazi-Azad Department of Computer Engineering

Sharif University of Technology Tehran, Iran

e-mail: [email protected], [email protected], [email protected]

Abstract— Networks-on-Chip (NoCs for short) are known as the

most scalable and reliable on-chip communication architectures for multi-core SoCs with tens to hundreds IP cores. Proper mapping the IP cores on NoC tiles (or assigning threads to cores in chip multiprocessors) can reduce end-to-end delay and energy consumption. While almost all previous works on mapping consider higher priority for the application’s flows with higher required bandwidth, a mapping strategy, presented in this paper, is introduced that considers multicast communication flows in addition to the normal unicast flows. To this end, multicast and unicast traffic flows are first characterized in terms of some new metrics which are then used for arranging communication flows based on their volume and priority. A heuristic approach is used to assign IP cores to NoC tiles. Simulation results for both synthetic and real applications show up to 49% (28% on average) performance improvement and 44% (22% on average) energy saving when compared to the best known mapping algorithm, nMap.

Keywords: Network-on-Chip, Multicast, Mapping

I. INTRODUCTION Traditional on-chip interconnect schemes, including point-to-

point links and shared buses, face technological and design problems as the system size (in terms of number of IP cores) grows. Point-to-point links suffer from poor scalability and considerable area overhead although they provide ideal delay and power profiles. Applicability of shared buses to future many core SoCs (with more than tens or hundreds IP cores) is also under question [6,8]. Hence, to reduce area overhead and increase efficiency of on-chip interconnect, packet switched networks-on-chip (NoCs) have attracted much attention [4,6,8]. Analytical and experimental evaluation of various on-chip interconnect design choices is presented in [13] where various on-chip interconnecting schemes (including point-to-point, shared bus, and NoC) are evaluated; it is then deduced that the NoC is the most predictable one with the error between estimated and actual delays obtained after layout extraction.

Different design requirements in the NoC context pose new challenges in developing design methodologies which do not exist in the traditional interconnection networks. The limited on-chip resources force a designer to find a way to distribute the application load over NoC resources (including switches, buffers, virtual channels, and wires) so that the metrics of interest are optimized and the application requirements are satisfied. This is

known as the topological mapping problem. Given the average communication volumes between IP cores for an application, a mapping strategy determines how to assign IP cores to NoC tiles in order to minimize point-to-point delay [3,8,17-19,21,24], provide the required bandwidth for running applications [9,10,17,18], reduce power consumption [3,9,10,21] or lower the chip temperature [2,11].

Almost all previously proposed topological mapping techniques in the NoC context [2,3,8-11,17,18,18,21,24] first try to identify communication flows of high bandwidth and then prioritize heavy flows for mapping. However, many scientific applications, including PARSEC [5] and SPLASH [23], involve multicast operations for data sharing, consistency, or network control [7,12]. Besides, in many SoC applications, input data can be effectively distributed over IP cores via multicast/broadcast operations. Multicast operations usually use and occupy huge amount of network resources which can lead to increased network congestion and hence higher network latency and power consumption. Indeed, prior topological mapping strategies give the lowest priority to such multicast messages due to their low generation rate and small size.

To mitigate performance degradation of multicast traces, we introduce a topological mapping scheme that takes the effects of multicast communication into account. More precisely, the proposed algorithm considers the bandwidth requirements and priority of communication flows (to guarantee the QoS) for both multicast and unicast operations. To achieve this, we introduce a design flow which first extracts the communication characteristics of a given application with respect to multicast and unicast traffic volumes, and then uses the extracted values to find a mapping that can efficiently handle both multicast and unicast flows.

The study presented in [26] is the most relevant work in which a methodology for synthesizing custom NoC architectures considering both multicast and unicast communication patterns is presented. To this end, the synthesizing and floor-planning problem is decomposed into some inter-related steps of finding proper flow partitions. They are then used to determine a proper physical network topology for each group in the partition and provide an optimized network implementation for the derived topologies.

The remainder of this paper is organized as follows. In Section II, we formulate the multicast-aware mapping problem. Section

2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

1066-6192/11 $26.00 © 2011 IEEE

DOI 10.1109/PDP.2011.76

455

III gives a brief overview of the related works in multicast operation and NoC topological mapping context. In section IV, the proposed multicast-aware mapping algorithm is introduced and then evaluated in section V under both synthetic and real CMP workloads. Finally, section VI concludes the paper.

II. MULTICAST-AWARE MAPPING PROBLEM Consider a NoC defined by a directed graph where a vertex

represents a router and a directed edge connecting vertex i to vertex j represents a unidirectional channel connecting router i to router j. The routing algorithm R(i,j) determines a set of routers to be traversed by a packet originated at the IP connected to router i and destined to the IP connected to router j. Wormhole switching is the most popular switching strategy in NoCs leading to the minimum power and end-to-end latency [23]. To guarantee deadlock freedom, a routing algorithm should not allow cyclic dependencies when forming different possible routes. Each edge in the graph is labeled with a triple set (BW,V,B) where BW shows the bandwidth of the associated physical link, V gives the number of virtual channels sharing the physical bandwidth of that channel and B is the buffer size of each virtual channel. Also, the application is assumed to be described in a directed graph known as application characteristic graph (ACG for short). Each vertex in ACG consists of a task set forming a processing module of the application which is assigned to some IP core. Based on the computation requirement of a task set and the selected IP core, each vertex i is annotated with an execution time, EX(i). Also, each directed edge from vertex i to vertex j in the ACG characterizes the communication requirement between these two sets and is labeled with a communication volume.

Given a network configuration, an ACG of the application, and the design constraints, the mapping strategy, in this paper, decides on how to topologically map IP cores considering the communication volume and priority of the traffic flows. By defining some metrics characterizing the outgoing bandwidth of each IP core and the priority of each trace, a proper tradeoff between message length, message generation frequency and priority of communication flows can be heuristically found. For instance, a heavy unicast flow may not be prioritized over a light multicast operation with high priority. Our heuristic approach tries to map IP cores onto NoC tiles such that the cores involved in high priority traces are put close to each other.

III. RELATED WORK Related work is reviewed in two contexts: multicast in

multicomputer interconnection networks and mapping strategies in NoCs.

A. Multicast in Multicomputers Multicast mechanism can be categorized in two groups:

software-based and hardware-based [24]. In software-based algorithms, multiple copies of the multicast message is generated which are separately routed to their final destinations (following

unicast-based routing algorithms). The main shortcoming of such algorithms is the considerable startup latency due to reinjection of the multicast message at intermediate routers [7,12,15]. In [7] and [12], multicast communication is considered as dynamic operation that can be initiated by any node at any time where multicast trees are saved in a table at each node. Note that defining such trees can consume much static power in table entries which limits the application of this method for large NoCs.

Hardware-based multicast techniques are mainly divided to Path-Based and Tree-Based algorithms [25]. They reduce network load but require more complex routers to handle multicast messages. Tree-based schemes (shown in Fig. 1.a as an example) try to form a tree rooted at the source node to deliver the multicast message to its final destinations (located at the leaves) on the branches. Using wormhole switching, tree-based strategies can lead to early network saturation [25]. However, this scheme can be implemented with store-and-forward or virtual-cut-through switching techniques [25] or in networks with light traffic loads.

Path-based hardware multicast is introduced to overcome early saturation problem associated with tree-based scheme [14]. Routing in a path-based multicast mechanism is usually based on node labeling. For instance, a node (x,y) in a 2D N×N mesh is labeled as [14]

( )oddisxifevenisxif

xNNyxNy

yxl⎩⎨⎧

−−+×+×

=1

, (1)

At the source node of a multicast operation, destination addresses are sorted with their labels and form an array. During the sort procedure, destinations are configured into multiple sets based on their relative position to the source node. Then, limited copies of the multicast message are routed to the first (that is nearest) destination of each destination set following a unicast routing strategy. When a multicast message is received by a destination, it removes its address from header flits and routes it to the next router in the sorted list (if exists). Fig. 1.b. shows the

Fig. 1. Tree-based multicast algorithm based on unicast XY routing (a); Dual-path algorithm based on unicast XY routing algorithm (b)

456

dual-path multicast mechanism (a mechanism that divides destinations into two sets, one including nodes with larger labels and one including nodes with lower labels than the source node’s label) based on XY routing algorithm. A multi-path multicast mechanism poses the minimum congestion in the network and hence provides the best performance [25]. We also use a similar strategy in this paper for managing multicast data flows.

B. Related Work in the NoC Area The topological mapping problem has been extensively studied

in recent years [2,3,8-11,17-19,21,24]. The mapping problem for NoCs was first introduced in [19] where a solution through a branch and bound algorithm for regular NoC architectures was proposed such that the overall energy consumption is minimized and the performance of the application is guaranteed through bandwidth reservation of the physical links. In [9,10], mapping strategies with communication energy minimization objective, subject to some bandwidth and latency constraints, were introduced. A multi-objective mapping algorithm for mesh NoCs was presented and evaluated in [3]. The proposed approach utilizes a genetic algorithm to find an optimal Pareto-front solution with respect to performance and power consumption. Tornero et al. [20] introduced a strategy for simultaneous mapping refinement and route selection in terms of a Pareto-optimal solution which optimizes the average delay and routing robustness. Hansson et al. [8] introduced a more general approach for application mapping and route selection in the presence of the best effort and guaranteed service traffics. Authors in [17] reported a strategy for mapping the application task graphs onto reconfigurable NoCs. Their results show that throughput is

considerably improved when the NoC architecture is reconfigurable. A methodology to map multiple applications (or use-cases that can run in parallel) onto the NoC architecture, to select paths for various flows and reserve resources in the NoC while satisfying constraints of each use-case was introduced in [18]. Authors in [23] formulated task mapping problem as a mixed quadratic integer programming (MQIP) that is inherently NP-hard. To solve this problem, they used successive relaxation and genetic approaches.

Considering thermal characteristics in the design methodology is inevitable especially in deep submicron technologies where the power density and cost of cooling are new challenges. Authors in [2] and [11] used genetic algorithms to place IP cores onto 2-D and 3-D NoCs such that the communications are minimized and the ultimate design is thermally-balanced.

To the best of our knowledge, almost all reported mapping algorithms do not consider the impact of multicast operations in the design of application-specific and general purpose NoCs.

IV. THE PROPOSED MULTICAST-AWARE MAPPING ALGORITHM In this section, we first introduce some metrics and then use

them for evaluating a candidate mapping when exploring the design space through the proposed mapping strategy with respect to multicast.

A. Multicast metrics In a multicast communication scheme, multicast

communication set, MCS, is defined as the set of unique multicast tuples in the form of m=<s,D> where s refers to an IP core sending a multicast message to all cores in set D={d1,d2, …, dk} for 1<k<N-1 (in which N is network size or the number of IP cores).

As any IP core can be the origin of many multicast messages with disjoint destination sets, we may have many multicast tuples with the same source s. The multicast degree, MD, of any IP core refers to the total number of times that the node appears as a source in different MCS sets.

Based on the destination sets of different multicasts, the multicast membership value, MMV, of an IP core d defines the total number of times that core appears in the destination set, D, of different MCS sets. In other words, MMV gives the number of multicast communication operations in which IP core d is a destination of multicast operations.

For a traffic scenario, the estimated outgoing bandwidth, EOB, is defined as the bandwidth required by a source IP core for its unicast and multicast communications. Multicast messages (e.g. coherency messages in CMPs) may have great impacts on the overall system performance. Hence, a designer may decide to consider a higher priority for multicast flows with respect to unicast flows. So, the EOB can be determined as the weighted sum of all communication bandwidths

4,43,42,41,4

4,2

4,3

3,2

1,30,3

0,4

2,3

1,20,2

3,10,1

4,02,01,00,0

4,12,1

3,0

3,3

2,2

1,1

Fig. 2. An example including three multicast set: <(3,3),{(2,3),(2,4),(4,4)}>,<(1,1),{(0,3), (0,4),(2,2),(2,3)}> and <(3,3),{(2,2),(3,2)}>

457

∑=>∈=<

×+=CsMCSDsm

mmUnicast

CC MCBWBWEOB,

(2)

where BWCUnicast denotes the required bandwidth of all unicast

communications for IP core C, BWm is the required bandwidth of multicast m with IP core C being its source. Also, MCm is the multicast coefficient (MC) which assigns a weight to multicast m as its priority.

The mapping algorithm proposed in the next subsection tries to effectively map IP cores onto NoC tiles taking into account the priority of multicast traces.

For example, consider a typical application including some multicasts as shown in figure 2. It includes three multicasts (shown in different colors) and is somehow mapped onto a 5x5 mesh NoC. Multicast m1 originated at s=(3,3) and destined to nodes D={(2,3),(2,4),(4,4)}, multicast m2 with its source s=(1,1) and destination set D={(0,3),(0,4),(2,2),(2,3)}, and multicast m3 is defined with source s=(3,3) and destination set D={(2,2),(3,2)}. As IP core (1,1) is the source of one multicast, its multicast degree, MD, is 1, while the MD value is 2 for IP core (3,3) and it is zero for other nodes. In this example scenario, the MMV value of IP cores (2,2) and (2,3) equals 2 (since they are destined in two multicast traces), while IP cores (2,3), (2,4), (4,4), (0,3), (0,4), and (3,2) have an MMV value of 1, and it is 0 for other cores.

B. Proposed algorithm Now, using the metrics defined above we introduce our

multicast-aware mapping algorithm. Figure 3 depicts different stages of the mapping approach. This approach includes two phases. In the first phase, the outgoing bandwidth of each source node is modified with respect to its multicast operations. Then, a heuristic algorithm to effectively map IP cores on NoC nodes based on the estimated outgoing bandwidth takes place.

Based on the estimated outgoing bandwidth values (defined in previous subsection) and communication volumes of the application characteristic graph, the required output bandwidth of each IP core is estimated. To this end, the MC factor of each multicast communication is computed using the MD and MMV of the involved IP cores. If the MD of a core is high, it infers that the core can potentially consume much network resources. Thus, higher multicast degree implies a larger MC for the operation and consequently larger EOB value for the source. If a source node with high MD sends a multicast message to some destinations distributed across the network, a multicast message may travel a long distance in the network. Consequently, the total network power and the average message latency will be increased. Likewise, if the MMV of a core is high, it means that the core receives a large amount of incoming traffic. Note that, this core can be an intermediate node which receives a multicast message

and relays it to some other cores in the destination set. Hence, if such a core is mapped in a path along with its sources (in all MCSs) and the final destinations of each multicast flow are scattered in the network, it would affect the delay and power consumption of the network badly. If the MD of an IP core is low (say 2 or 3), it can be less credited and even treated like a unicast flow.

Different approaches (including heuristics) can be used to compute the MC values. Here, we use a fuzzy scheme as it can well describe the concept of degree and priority. Moreover, with the possibility of adding human knowledge, one can enrich the knowledge database of the fuzzy inference engine. We use MATLAB Fuzzy Toolbox and our inferring engine is Mamdani’s inference engine [16]. Details on how to use fuzzy theory to implement our algorithm are beyond the scope of this paper and can be found in [16]. Note that any alternative heuristic or approximation can be used for computing MC values.

We follow a heuristic approach for mapping IP cores onto NoC tiles such that end-to-end delay and network power are minimized, while the application bandwidth requirement is guaranteed through resource reservation. Figure 4 presents our multicast-aware mapping strategy through two procedures: Map and Place. Procedure Map starts with mapping one core and then heuristically tries to map other cores. In each iteration, a core among the non-mapped cores, G, is selected for mapping on NoC tiles (using function Find(G) which investigates all non-mapped cores and chooses the one with the largest EOB value). Then, the procedure tries mapping the selected core with highest priority by exploring a mapping tree to find multiple candidate mappings for an IP core.

Fig. 3. The proposed multicast-aware mapping strategy

458

When the candidate tiles for an IP core are determined, procedure Place is initiated to investigate the applicability of each mapping and to choose among the proper mappings with the minimum power consumption. Mapping an IP core on an NoC tile is desired if the increased load on network resources does not exceed the maximum available bandwidth. For instance if assigning a new IP core results in a required bandwidth for a physical link which is beyond its maximum bandwidth, this mapping cannot be chosen as it cannot guarantee the application bandwidth requirement.

In the current mapping strategy forming an ordered array for each multicast source and destination set is very important especially in path-based multicast schemes. As previously stated, in a path-based multicast scheme, a typical multicast message is routed through the network nodes based on their relative position to the source node. If the source and destination set (based on their relative position) shape a proper route pattern, the traffic can be minimized and hence the performance can be maximized. In this case, the amount of resources occupied by a multicast flow would be less than when the nodes of the multicast flow are mapped without a proper arrangement. As an example of multicast destination diversity consider figure 5. In this case, destination sets are arranged as a proper pattern of nodes with regard to the path-based routing algorithm. In figure 5.a, if the communication volume between source node and its three destinations A, B, and C (as a multicast set) require a high bandwidth, it would be wise to rearrange the destination nodes as shown in figure 5.b. So, all destination cores can be covered with a single short path supported by a minimal unicast routing algorithm. As a multicast operation with high priority can block multiple unicast flows, applying such arrangements can efficiently reduce the total multicast (as well as unicast) latency and power consumption.

Note that procedure Map may return multiple mappings which are then simulated (with real traces of the running application) to select the most efficient one.

V. SIMULATION RESULTS In this section, we evaluate the proposed multicast-aware

mapping strategy and compare it with the well-known nMap (a mapping schemes which considers no priority for multicast communication traces over unicast ones). Final mappings found by the proposed and nMap algorithms are evaluated on the NoC architectures simulated by Xmulator [1], a fully-parameterized discrete event simulator for interconnection networks, augmented with Orion library [22] to calculate the power consumption of the interconnect components.

; ;

Algorithm Map (ACG, NoC, H, R) // G(V,E) is the application characteristic graph with required bandwidth // between IP cores i and j denoted by wi,j as the main communication graph. // NoC(BW,V,B) is the NoC platform. Each core of ACG will be assigned to a tile // of the NoC. No core of ACG is mapped to NoC at the beginning. // H ACG is the set of cores which are already mapped to NoC. // R is a set that will include all the mappings found. It is empty at the beginning. { if ( ) then ( ); ( ); ;

end if if ( ) then R R + NoC; else while ( ) do ( );

if then ( ); ( , , , ); end if ( );

; ;

end while R R + NoC; end if }

Algorithm ( , , , ) // This algorithm places node on a P node. {

Form set T, including all nodes already mapped which are in the destination set of all multicasts with node as the source; Form set Z, including the source nodes of multicasts that node is a member of their destination set;

;

for all do if then ; for all do if then ; Place at a NoC node with the minimum cost with respect to Z and ; // There are some nodes of multicast tuples each with a source already // placed. The placed nodes of each multicast tuple make some array (path) in // NoC. We must place node within the array that has minimum length // (including node).

}

Fig. 4. Pseudo-code for multicast-aware routing algorithm.

459

A. Synthetic workloads To have an exhaustive evaluation of the proposed mapping

algorithm, we start with investigating the method under synthetic traffic loads with different multicast set sizes. In these experiments, each IP core generates messages following a Poisson distribution. The multicast and unicast destination addresses are uniformly randomly chosen from network nodes. We examined a random number of destinations for each multicast (within the given range) including 4 to 8 IP cores (referred as set R1), 4 to 10 IP cores (referred as set R2), 6 to 10 IP cores (referred as set R3), and 6 to 12 IP cores (referred as set R4). For each set, we experiment on the number of different multicasts (here 16, 32, and 641). Simulation experiments are performed for a 128-bit wide 4×4 mesh NoC with wormhole flow control. XY routing algorithm is used for unicast messages while multicast routing is the multipath multicast [25]. Unicast messages have 10 flits size and multicast message size is either 2 (for multicasts with a small destination set of K/2 nodes where K is the maximum possible destination set size in the corresponding experiment) or 3 flits (for multicasts with larger destination set sizes). Each flit is transmitted over a physical link in unit cycle. Moreover, the process feature size and working frequency of the NoC is set to 70nm and 2.5GHz, respectively. Each physical channel is equipped with 2 virtual channels and each input virtual channel has a 2-flit buffer. The link length is considered to be 1 millimeter.

Figure 6.a and figure 6.b depict the average message latency and total network power for various synthetic workloads, respectively. In these scenarios, only 2% of generated messages

1 For instance consider an IP core i which is randomly selected as a multicast source node. Then for range R1, a random number between 4 and 8 is generated to indicate the number of destinations for that multicast trace (suppose this random number is 5). Hence, a multicast message with source i and 5 randomly selected IP cores as destinations are injected to the network. Note that randomly selected destinations can be selected from all network nodes except the source IP core i.

are multicast (since multicast messages usually contain control data and hence are limited in size and generation rate) while the rest of 98% of messages are unicast. A reader can conclude that for a specific destination set size (R1 to R4), the efficiency of the proposed algorithm is more considerable when more multicast messages are generated.

Note that when the destination set size range (in R1 to R4) is increased, the proposed method behaves better. This is because the proposed mapping algorithm takes care of multicasts and scenarios with more multicast operations can benefit more. As can be seen in the figures, using the proposed multicast-aware mapping algorithm can lead up to 49% (27% average) improvement on performance and up to 44% (25% average) reduction in energy when compared to nMap mapping scheme. Note that such a great improvement is achieved when only 2% of generated messages are multicast. Much larger improvements are expected if multicasts form a larger portion of the generated traffic load.

B. Real CMP applications Topological mapping of IP cores on NoC tiles is the main

concern in designing application specific SoCs (AS-SOCs). However, in CMPs consisting of a homogenous set of processors each with private L1 cache and shared L2 cache, the important issue is how to assign threads to processors to achieve the best performance and least power consumption.

Fig. 6. Average message latency (in cycles) and total network energy (in µJ) forvarious synthetic workloads (R1 to R4 with 16, 32, or 64 different multicasts) on a4×4 Mesh NoC.

Fig. 5. Rearrangement in multicast destinations based on path-based routing algorithm: (a) a multicast with destination set diversity, (b) rearrangement ofdestinations.

304.

2

532.

35

0

100

200

300

Ave

rage

Mes

sage

Late

ncy

(cyc

les)

nMap Prop. Alg.

21.5

4

0

4

8

12

16

Tota

l Net

wor

k En

ergy

(µ J

)

nMap Prop. Alg.

460

We experiment on SPLASH-2 multi-thread benchmarks [23] as CMP workloads. The multicast traffic appeared in SPLASH-2 benchmarks is due to cache coherency protocol [23]. Simulation experiments are performed for a 32-bit wide CMP with 4×4 mesh NoC. XY wormhole routing algorithm is used for unicast messages while multicast routing is based on path-based algorithm. Unicast messages are of 5 flits and multicast message size is 2 flits. It takes one cycle to transmit a flit over a physical link. Moreover, the process feature size and working frequency of the system is set to 70nm and 2.5GHz, respectively. Each physical channel is equipped with 2 virtual channels and each input virtual channel has a 2-flit buffer. The link length is considered to be 1 millimeter.

Figure 7.a and figure 7.b depict the average message latency and total network power consumed for SPLASH-2 benchmarks, respectively. Here, the difference between the proposed mapping algorithm and nMap is mainly a function of the multicast traffic rate. For instance, FFT program has a low multicast rate with small average destination set size. So, performance and power improvement is not considerable when using the proposed multicast-aware mapping algorithm (with respect to nMap results). However, for the programs, such as Ray-Trace, with a higher multicast rate and larger average destination set size, the

improvement (in both performance and power) is considerable. Simulation results show up to 33% (29% average) performance

improvement and up to 27% (19% average) energy saving when the multicast-aware mapping strategy is applied.

VI. CONCLUSION In this paper, we first introduced some metrics characterzing

the required bandwidth of IP cores with regards to their unicast and multicast communications. These estimated bandwidth values and the NoC configuration were used as inputs to some heuristic mapping strategy which tries to place frequently communicating IP cores adjacent with a priority for the multicast. Simulation results for both synthetic and real CMP applications confirmed the efficiency of the multicast-aware mapping strategy that could achieve up to 49% (28% in average) performance improvement and up to 44% (22% in average) energy saving when compared to the famous nMap algorithm.

REFERENCES [1] Xmulator NoC simulator, 2008, http://www.xmulator.com/. [2] C. Addo-Quaye, “Thermal-aware mapping and placement for 3-D NoC

designs,” in Proc. SoCC, 2005, pp. 25-28. [3] G. Asica, V. Catania, and M. Palesi, “Multi-objective mapping for mesh-

based NoC architectures,” in Proc. CODES+ISSS, 2004, pp. 182-187. [4] L. Benini and G. De Micheli, "Networks on chips: a new SoC paradigm,"

IEEE Computer, vol. 35, pp. 70-78, Jan. 2002. [5] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The PARSEC benchmark suite:

Characterization and architectural implications," in Proc. PACT, 2008, pp. 72-81.

[6] W.J. Dally and B. Towles, "Route packets, not wires: on-chip interconnection networks," in Proc. DAC, 2001, pp. 684-689.

[7] N.D. Enright, L.S. Peh, and M.H. Lipasti, "Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence," in Proc. MICRO, 2008, pp. 35-46.

[8] A. Hansson, K. Goossens, and A. Radulescu, “A unified approach to mapping and routing on a network-on-chip for both best-effort and guaranteed service traffic,” in Proc. VLSI Design, 2007.

[9] J. Hu and R. Marculescu, “Energy- and performance- mapping for regular NoC architectures,” IEEE TCAD, vol. 24, no. 4, 2005, pp. 551-562.

[10] J. Hu and R. Marculescu, "Exploiting the routing flexibility for energy/performance aware mapping of regular NoC architectures," in Proc. DATE, 2003, pp. 688-693.

[11] W. Hung et al., “Thermal-aware IP virtualization and placement for networks-on-chip architecture,” in Proc. ICCD, 2004, pp. 430–437.

[12] N.E. Jerger, L.S. Peh, and M. Lipasti, "Virtual circuit tree multicasting: A case for on-chip hardware multicast support," in Proc. ISCA, 2008, pp. 229-240.

[13] H.G. Lee et al., “On-chip communication architecture exploration: a quantitative evaluation of point-to-point, bus, and Network-on-Chip approaches”, ACM Trans. on Design Automation of Electronic Systems, Vol.12, No.3, 2007.

[14] X. Lin and L.M. Ni, "Deadlock-free multicast wormhole routing in multicomputer networks," in Proc. ISCA, 1991, pp. 116-125.

[15] Z. Lu, B. Yin, and A. Jantsch, "Connection-oriented multicasting in wormhole-switched networks on chip," in Proc. ISVLSI, 2006, pp. 205-211.

[16] E.H. Mamdani and S. Assilian, "An experiment in linguistic synthesis with a fuzzy logic controller," J. Human Computer Studies, vol. 51, no. 2, pp. 135-147, 1999.

[17] M. Modarressi and H. Sarbazi-Azad, “Power-aware mapping for reconfigurable NoC architectures,” in Proc. ICCD, 2007.

[18] S. Murali et al., “A methodology for mapping multiple use-cases onto networks on chips,” in Proc. DATE, 2006, pp.118–123.

[19] S. Murali and G. De Micheli, "Bandwidth-constrained mapping of cores onto NoC architectures," in Proc. DATE, vol. 2, 2004, pp. 896-901.

Fig. 7. Average message latency (in cycles) and total network energy (in µJ) whenassigning threads of SPLASH-2 programs on a CMPS with 7×7 Mesh topology.

180

0

40

80

120

160A

vera

ge M

essa

ge La

tenc

y (c

ycle

s)nMap Prop. Alg.

18

0

4

8

12

16

Tota

l Net

wor

k En

ergy

(µ J

)

nMap Prop. Alg.

461

[20] J.D. Owens et al., "Research Challenges for On-Chip Interconnection Networks," in IEEE MICRO, vol. 27, pp. 96-108, 2007.

[21] R. Tornero, V. Sterrantino, M. Palesi, and J.M. Orduna, “Multi-objective strategy for concurrent mapping and routing in networks on chip,” in Proc. IPDPS, 2009. pp. 1-8.

[22] H.S. Wang, X. Zhu, L.S. Peh, and S. Malik, "Orion: a power-performance simulator for interconnection networks," in Proc. MICRO, 2002, pp. 294-305.

[23] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, "The SPLASH-2 programs: characterization and methodological considerations," in Proc. ISCA, 1995, pp. 24-36.

[24] J. Wooyoung and D.Z. Pan, “A3MAP: Architecture-Aware Analytic Mapping for Networks-on-Chip,” in Proc. ASP-DAC, 2010, pp. 523-528.

[25] C. Yalamanchili, L. Ni, and J. Duato, Interconnection Networks: An Engineering Approach, Morgan Kaufmann, 2003.

[26] S. Yan and B. Lin, "Custom networks-on-chip architectures with multicast routing," IEEE TVLS, vol. 17, no. 3, pp. 342-355, 2009.

462


Recommended