University of Bristol · Web viewDiametrical 2D mesh [13] adds eight extra links to mesh topology...

XTRANC: A Routing Algorithm for Dynamically Reconfigurable

Networks-on-chipArash Farhadi Beldachi *1, Simon Hollis**2, Jose L. Nunez-Yanez *3

* Department of Electronic Engineering University of Bristol, UK

1 [email protected] [email protected]

** Department of Computer Science University of Bristol, UK

2 [email protected]

Abstract—This paper presents a novel routing algorithm called XTRANC which supports topologies based on a variable number and size of inner-torus building blocks. The inner-tori partition a traditional mesh network into an arbitrary number of sub-networks to increase the mesh performance. The sub-networks can generate non regular global topologies which are also supported by the XTRANC algorithm. XTRANC is especially suitable for dynamically reconfigurable networks mapped to commercial FPGAs in which additional links are added to the mesh topology at run-time to reduce congestion depending on application behaviour and resource availability. XTRANC allows the insertion of links as requested by different parts of the application without centralized control and this research shows that despite this dynamic behaviour the routing algorithm remains deadlock free.

1. IntroductionNoC (Networks-on-Chip) topologies can be constructed with regular (i.e. uniform) or irregular (i.e. custom) topologies. Regular topologies such as the traditional mesh are more popular due to, for example, their modularity and predictability which helps their design, verification and analysis. On the other hand, the mapping of irregular task graphs to these topologies can result in local congestion and lower link utilization. Custom topologies avoid this by being designed for specific applications and usually impact the numbers of routers and links as well as offering better system performance. On the other hand, they lack general purpose computation capabilities and are complex to design [1].

Among regular networks, the mesh (e.g. 2D/3D) is the most popular way of achieving a scalable communication model in NoC systems [2]. The torus topology is a good alternative to increase the throughput and reduce the average latency but, when the network is large, the achievable clock frequency could degrade due to the need to create long links between the border nodes. The partitioning of the mesh network to smaller networks to form inner-tori could be a good solution to increase the mesh network performance and bridge the performance gap between regular and customized networks. The creation of the inner-tori means that the internal routers must be able to modify its architecture to support additional communication ports. Also, the additional links affect the regularity of the original topology and demand a special routing algorithm. In this scenario, a topology is not completely irregular, but has a regular underlying topology with extra links.

In this work, we have proposed a new routing algorithm called eXtended Torus Routing Algorithm for NoC (XTRANC) which is based on the Torus Routing Algorithm for NoC (TRANC) [3]. We have selected TRANC because it uses only one virtual channel per physical channel and it is a deadlock free routing algorithm for the Torus NoC. TRANC and XTRANC take advantage of the Interconnection Routing Notation (IRN) [3], which is a map-based systematic approach of designing routing algorithms for mesh and torus NoCs and can be extended to other topologies as well.

XTRANC is suitable for both homogeneous and irregular networks. It helps mesh networks to achieve better performance by partitioning them into sub-networks and it is helpful when there is a need to add extra link(s) to the mesh networks to reduce the latency in specific paths according to the current traffic pattern. This routing algorithm is designed to work in NoCs implemented in FPGAs (Field Programmable Gate Arrays) that support dynamic reconfiguration. Dynamic reconfiguration allows changing the network logic at run-time and is available in state-of-art FPGAs from major manufacturers. The contributions of this paper can be summarised as follows:

mailto:[email protected]

1) We propose and develop the dead-lock free XTRANC algorithm for dynamically reconfigurable networks using Interconnection Routing Notation.

2) We use a physical prototype based on the SoCWire NoC to evaluate the complexity, latency and throughput of the XTRANC algorithm.

3) We evaluate the performance of XTRANC in homogeneous and irregular networks and compare it with the popular 2D mesh topology.

The rest of the paper is organized as follows: Section 2 reviews related works. Section 3 briefly introduces the Interconnection Routing Notation used to develop XTRANC. XTRANC is introduced in section 4 focusing on its dead-lock free behaviour. Section 5 introduces our performance evaluation platform based on FPGA devices. Section 6 investigates XTRANC deployed in a homogeneous network resulting from partitioning a mesh network into an arbitrary number of inner-tori. Section 7 evaluates XTRANC after adding extra-long range links to the mesh network creating irregular topologies. Finally, Section 8 concludes the paper.

2. Related WorksSeveral routing algorithms based on the traditional ‘X-Y’ algorithm have been proposed and here we highlight key work. A comprehensive review is available in [4] that includes routing algorithms for NoC architectures at all abstraction levels, from the algorithmic level to actual implementation. Odd-even turn [5] is a routing algorithm to avoid the deadlock by putting restrictions on some locations where turns can take place. In DyAd [6], is combination of the oe-fix, which is a deterministic routing algorithm, and the odd-even. In this scenario, routers can switch between two routing algorithms considering the congestion situations of the network. Intermittent X, Y (IX/Y) [7] is introduced to achieve fewer collisions. This algorithm takes the advantage of the “X-Y” and “Y-X” algorithms intermittently to choose the next hope and improves the average delay. A variable switch between ‘0’ and ‘1’ to create the choosing mechanism. In this scenario, the variable sets to ‘0’ for the first packet to select ‘X-Y’ routing algorithm and sets to ‘1’ for the second packet to choose “Y-X” routing algorithm. The switch between ‘0’ and ‘1’ continues for the next packets. The authors used a simulator to evaluate a 4x4 mesh network.Dynamic XY (DyXY) [8] is an adaptive routing algorithm which is based on congestion condition in the proximity. In this algorithm, the packets only route shortest path between source and destination. The shortest path will be selected depends on the network congestion condition when there are multiple shortest paths between source and destination. A simulator was employed to measure the performance. XYX [9] is a fault-tolerant routing algorithm. This routing algorithm employs the ‘X-Y’ and the “Y-X” algorithms and designed for the higher traffic load. In this routing algorithm, the source node adds an error detection code to the header of the packet and then, it copies a packet before sending. After that it keeps the copied packet as the redundant and sends the packet to the destination. The destination node checks the packet, calculates the error detection code and compares it with the one in the packet header. The packet is accepted when there is not any error. If there is an error, the destination node sends a NACK message to the source node to request for resending the packet using the redundant packet. The source node sends the original packet to the destination using the ‘X-Y’ routing algorithm and sends the redundant packets employing the “Y-X” routing algorithm. A simulation method has been used for experimental analysis in this work.[10] proposed an adaptive routing algorithm called Adaptive XY which is operating as a deterministic ‘X-Y’ or adaptive routing depends on the congestion conditions of the current router neighbours. In this scheme, a switch which is context aware agent is used to route the packets according to the network congestion status and a packet is route through a less congested path in the high congestion scenario. This work used simulation to evaluate the algorithm.

In contract to these works, XTRANC is specially designed to work in NoCs implemented in FPGAs and supports both homogeneous and irregular networks. It helps mesh networks to achieve better performance by partitioning them into inner-tori and it is helpful when there is a need to add extra-long range link(s) to the mesh networks to reduce the latency in specific paths according to the current traffic pattern using dynamic reconfiguration.”

There are other efforts investigating alternative regular topologies for the mesh topology and adding extra-long range links to a mesh network to increase the network performance. We have highlighted relevant examples in this section. A new NoC topology, called a partially interconnected mesh (full-mesh) network, and a routing algorithm to support the architecture is presented in [11, 12]. This architecture extends the mesh networks by adding four bidirectional channels to remove the congestion and hot spots compare to the mesh networks but it introduces a significant overhead. Diametrical 2D mesh [13] adds eight extra links to mesh topology to reduce the mesh diameter when it is expanded by a large number of IP cores and tries to minimize the area and power

consumption. The number of extra links is always fixed and equal to eight and does not grow as the number of IP cores increases. Adding eight extra links to a mesh network partitions it to four sub-networks which have an overlap, and these extra links connect the edge of the sub-network to each other. The main issue is that, the sub-networks size grows by the size of IP cores and makes the extra links longer, which causes difficulty in the later stages of the implementation, such as place and route. In addition, traffic load is not well balanced and distributed in this method. [14] presented the mesh networks with random additional links. This work proved that adding random extra links can increase the critical load of a network which causes performance degradation. This work is a theoretical work and demands a practical analysis and implementation. [15] introduces a technique to synthesize an optimal irregular mesh topology for a specific application which demonstrated that an irregular mesh can increase the performance compared to a traditional mesh topology. In this method, a computationally expensive analysis of the frequency and magnitude communication between IP cores in the application is performed to identify the optimum locations to add Long Range Links (LRLs) to the traditional mesh topology. Therefore, the frequency of communication between each node during the execution of the application on a traditional mesh must be identified and this technique is suitable for applications with predictable communication behaviour. This method does not need any global information for routing but needs local information (at least 1 neighbour) of the network to compute the distance. The shorter route calculation requires time which could increase the router latency. The resulting network with this technique is application specific and is not suitable for general purpose computing. [16] introduced the Skip-link architecture that dynamically reconfigures NoC topologies to reduce the overall switching activity in many-core systems. This architecture allows the creation of long range skip links at runtime to reduce the logical distance between regularly communicating nodes. However, the logical distance between some other nodes will be increased and cause hop penalties and the Skip-links only allowed to be placed after summarise all of the traffic flows, the hop-count savings outweigh the penalties [16] and it is not targeted for reconfigurable chips.To the best of our knowledge, this is first work to propose a routing algorithm for both modified regular and irregular mesh with extra-long range links specially design for dynamically reconfigurable FPGAs. Our proposed routing algorithm does not need any lookup tables and global or local information when the extra links are added and it is designed to work on reconfigurable chips that can alter the network infrastructure at run-time.

3. BackgroundInterconnection Routing Notation (IRN) [3] is a map-based systematic approach to design routing algorithms for mesh and torus based NoCs. This notation is useful to obtain a better understanding and formulation of routing algorithms for NoCs and validate new algorithms as deadlock free. This approach can be extended to create routing algorithms for other interconnection topologies as well. The IRN explores the rule of movement via one dimension and is suitable for an ‘X-Y’ based routing algorithm in which a packet first traverses the ‘X’ and then ‘Y’ dimension to reach its destination. The IRN has 2 sections, the IRN Graph and Map. The IRN Graph shows the direction of the movement from a source node to other nodes, which are destination nodes by arrows. The IRN Map is based on the IRN Graph and displays the direction that packet should traverse from the current node to get closer to the destination.

Fig.1 shows the IRN for 5×5 mesh topology with the ‘X-Y’ routing algorithm. As it can be seen in Fig.1.a, the first row of the IRN Graph, which has been marked with the adjacent moves, displays a row of the network to reveal the topology. The next rows display the direction of the movement from a node to other nodes, which are destination nodes, with arrows. In the original IRN Graph, directions of movement for destinations with more than one hop are considered. In Fig.1.a, we have considered all the directions for clarity purposes. Fig.1.b displays the IRN Map which is based on the IRN Graph of Fig.1.a. Each row number presents a current node and each column number shows a destination node. Each cell contains the direction that a packet should traverse from the current node to get closer to the destination in this figure. The ‘+’ and ‘-’ signs show the movement in the positive and negative directions of the dimension respectively. The ‘+’ and ‘-’ signs in circles present the movements which are unchangeable because it is not reasonable to select the opposite direction to reach the destination. The ‘+’ and ‘-’ signs without circles are selectable and can create routing algorithms although some of them are deadlock-free and some have the possibility of generating deadlock. To find the best routing algorithm using the IRN notation which is deadlock-free and optimal, the following rules should be considered [3]:

There should not be more than one sign change in each column to avoid livelock. Also, there must not be more than one sign change in a row as well.

It is better to have equal number of ‘+’s and ‘-’s for the selectable area and in each row to achieve a more optimal routing.

There should be one row with all selectable ‘-’ and another one with ‘+’ movements for the algorithm to be deadlock-free [3].

a) IRN Graph b) IRN MapFig.1. The IRN for a 5×5 mesh using ‘X-Y’ routing algorithm

Fig.2. The IRN Map for the TRANC in 1dimension of an n×n torus (n=4 to 8) [3]

IRN has been employed to propose a routing algorithm called Torus Routing Algorithm for NoC (TRANC) [3] which is a deadlock-free deterministic routing algorithm for the torus NoCs and it uses only one virtual channel per physical channel. Fig.2 presents the TRANC IRN Map for one dimension of the 4×4 to 8×8 torus networks. The IRN Map for a ring with n nodes is generated by adding a row and column to the IRN Map of the ring with n-1 nodes. The right most two cells should be filled with ‘-’ and others with ‘+’ signs for the newly added rows and the two lower cells must be filled with ‘+’ and others with ‘-’ signs for the newly added columns. A loop between movements in positive or negative directions causes deadlock. TRANC is a deadlock-free routing algorithm. For example, if we consider an 8×8 torus as it can be seen in Fig.2 , there is not a positive movement of more than one hop from node 5 to other nodes which breaks the positive loop in the network. In addition, there is not a negative movement of more than one hop from node 6 to others which breaks the negative loops in the network as well.

4. Proposed Routing Algorithm

4.1. Algorithm descriptionFig.3 shows a ring topology with 16 nodes in 1 dimension which is ‘X’ in this case. This topology is the basic topology component of the torus topology. In this topology, the border nodes, which are the first and the last nodes (i.e. nodes R0 and R15) are connected by a wrap-around link. When a node sends a packet to another node, according to the destination address, the routing algorithm makes a decision to send the packet to the left (west) or right (east) port of the router.

Fig.3. Ring topology with 16 nodes

Fig.4. An inner-ring with 9 nodes

Fig.4 shows an inner-ring with 9 nodes. In this inner-ring, an extra link, which is ‘i’, connects left (Rs) and right (Rs+8) border nodes. There are two different types of nodes in an inner-ring: The nodes inside the ring and border nodes. Consequently, a routing algorithm for the inner-ring topology should consider the two different types of nodes.

Fig.5. The IRN for a 4×4 inner-ring (1D XTRANC)

The IRN has been employed to find the routing algorithm called 1D XTRANC for the inner-ring topology. Fig.5 shows the IRN Graph and Map for an inner-ring with four nodes which starts at Xs. As it can be seen in this figure, there are two main differences between the IRN Graph and Map for inner-torus and torus. The nodes numbers are shifted with an offset which is the left border of the inner-torus and direction ‘i’ is added to the IRN Graph and Map. The ‘i’ is a bidirectional link between the left and right border of the inner-torus. Considering the ‘i’ direction is the negative direction when the left border node is the source and the right border node is the destination, and the ‘i’ direction is the positive direction when the right border node is the source and the left border node is the destination. In this scheme, there is not more than one sign change in each column and this avoids livelock. Also, there is not more than one sign change in a row as well. In addition, there is not a loop between movements in positive or negative directions which causes deadlock. As it can be seen in Fig.5, this algorithm is a deadlock-free algorithm because there is not a positive movement of more than one hop from node Xs+1 to other nodes which breaks the positive loop in the network. In addition, there is not a negative movement of more than one hop from node Xs+2 to others which breaks the negative loops in the network as well.

Fig.6 presents the IRN Map for the inner-ring topology with offset Xs which starts at radix n=4 to n=8. The IRN Map for a radix n ring with offset Xs is generated by adding a row and column to the IRN Map of radix n-1 with offset Xs ring. The right most two cells should be filled with ‘i/-’ and others with ‘+’ signs for the newly added rows. The ‘i/-’ means ‘i’ should be selected when this row is the first row and ‘-’ must be selected when the row is not the first row. The two lower cells must be filled with first ‘i’ and second ‘+’ and others with ‘-’ signs for the newly added columns. Consider an 8×8 inner-torus as it can be seen in Fig.6 , there is not a positive movement of more than one hop from node Xs+5 to other nodes which breaks the positive loop in the network. In addition, there is not a negative movement of more than one hop from node Xs+6 to others which breaks the negative loops in the network as well. This shows that XTRANC is a deadlock-free routing algorithm according to the proof proposed in [3] and summarized in section 3.

Fig.6. The IRN Map for the inner-ring topology (1D XTRANC, n=4 to 8)

4.2. Operational Examples

We have considered that there are not any nested inner-rings and/or overlap between inner-rings in an inner-ring topology and each node can only use one extra link in this dimension. Therefore, it is possible to add one or more inner-ring to the topologies. In this scenario, we have segmented the routing algorithm into more than one segment. Each segment contains XTRANC or ‘X’ routing algorithm and both of them are deadlock free. The routing algorithm used to connect the different segments is ‘X’ which is deadlock free routing algorithm. The following examples describe this concept. Fig.7 and Fig.8 show two examples of the inner-ring topology. As it can be seen in these figures, it is possible to have only inner-rings or regular nodes in the topology.

Fig.7. An inner-ring topology with 3 segments

Fig.8. An inner-ring topology with 4 segments

Fig.7 displays an inner-ring topology which has been segmented into 3 segments: an inner-ring with 9 nodes, 4 regular nodes outside the left border and 3 regular nodes outside the right border. In this scenario, there are three routing domains: segment 1 and segment 3 uses ‘X’ routing algorithm and segment 2 employs XTRANC. In this scenario, we study all different communications among the segments:

1. The source node is in the segment 1 and the destination is in the segment 3: As an example we have considered node R1 as the source and node R15 as the destination address. First, segment 1 uses ‘X’ routing algorithm to send the packet from node R1 to node R2 and then node R3 which is the right border of this segment. Then, this node passes the packet to its right neighbour which is node R4, the left border of the segment 2, via ‘X’ routing algorithm. After that, node R4 sends the packet from left border to node R12,

which is the right border of the segment 2, through the extra link (i). Then, the right border of the segment 2 sends the packet to its right neighbour which is the left border of segment 3 using ‘X’ routing algorithm. Finally, node R13 uses ‘X’ routing algorithm to send the packet to its destination in this segment which is node R15 via node R14 using ‘X’ routing algorithm. A similar algorithm will be used when the source node is in segment 3 and the destination is in segment 1.

2. The source node is in the segment 1 and the destination is in the segment 2: As an example we have considered node R1 as the source and node R7 or R10 as the destination addresses. First, segment 1 uses ‘X’ routing algorithm to send the packet from node R1 to node R3 which is the right border of this segment. Then this node passes the packet to its right neighbour which is node R4, the left border of the segment 2, using ‘X’ routing algorithm. In this step, the packet is inside the segment 2 and node R4 sends the packet to the destination according to the IRN Map for the inner-ring topology. Node R4 sends the packet through the nodes R5 and R6 to the destination node when the destination node is node R7 or using extra link ‘i’ and sending the packet through the nodes R12 and R11 to the destination node when the destination node is node R10. A similar algorithm will be used when the source node is in the segment 3 and the destination is in the segment 2.

3. The source node is in segment 2 and the destination is in segment 1 or segment 3: As an example we have considered node R7 as the source and node R1 as the destination address. First, segment 2 uses the IRN Map for the inner-ring topology to send the packet from node R7 to node R4 which is the left border of this segment via nodes R6 and R5. Then, this node passes the packet to its left neighbour which is node R3, the right border of the segment 1, using ‘X’ routing algorithm. Finally, node R3 uses ‘X’ routing algorithm to send the packet towards the destination node which is node 0 through the nodes R2 and R1.

Fig.8 has 4 segments which are three inner-rings and one segment with 2 regular nodes. The segment with regular nodes employs ‘X’ routing algorithm and the inner-ring segments use XTRANC. The connections between the different segments take the advantage of ‘X’ routing algorithm as well. For example, we consider node R1as the source and node R13 as the destination nodes. First, node R1 uses the IRN Map for the inner-ring topology and passes the packet to the R3 which is the right border of segment 1 via node R0. Then, node R3 sends the packet to its right neighbour, node R4 in segment 2, using ‘X’ routing algorithm. After that, node R4, which is the left border of the inner-ring 2 in segment 2, sends the packet to the right border in this segment, which is node R8, via extra link i2. Then node R8 sends the packet to the right neighbour node R9 which is the left border of segment 3 using ‘X’ routing algorithm. Node R9 sends the packet to the node R10 right border of segment 3 using ‘X’ routing algorithm. After that, node R10 employs ‘X’ routing algorithm to send the packet to the right neighbour which is node R11, the left border of the segment 3. Finally, node R11 uses the IRN Map for the inner-ring topology to send the packet towards the destination node in this segment which is R13 via node R12.

4.3. Implementing the 1D XTRANC

There are two parameters, which are Ring and Node that have been used for the router initialisation in the network to implement the routing algorithm. If the node is a regular node, both Ring and Node parameters are set to zero. Parameter Ring is set to 1 when the node is inside the inner-ring and parameter Node is set to 1 when the node is on the border of the inner-ring. The regular nodes which are not in an inner-ring employ the ‘X’ routing algorithm and to implement the IRN map for an inner-ring, we consider the current node location in the inner-ring which is a) inside node or b) border node of the inner-ring. Fig.9 and Fig.10 show the routing algorithm pseudo codes for the nodes inside and the border nodes of the inner-rings respectively. When the source node is inside and the destination nodes are outside the inner-ring, the packet will be routed to the inner-ring border which is the same side as the destination node. When the source and destination nodes are in the inner-ring, the packet will be routed according to the IRN Map for the inner-ring topology.

Fig.9. Pseudo code for the nodes inside the inner-rings Fig.10. Pseudo code for the border nodes of the inner-rings

The inner-ring topology is the basic topology component also employed in 2D (i.e. ‘X-Y’).The same as 1D, XTRANC for a 2D networks is deadlock and livelock free. In the 2D scenario, a packet first traverse its route towards its destination across the X axis for the X dimension and then across the Y axis for the Y dimension. When traversing a dimension, it is allowed to forward a packet to overshoot its destination using XTRANC and back track the same dimension. By enabling the ability to make turns in this way, 2D XTRANC prevents any cyclic dependency in reserving and using network channels by messages. The proof is the same as for the restricted turn model in [17]. XTRANC pseudo code for a 2D networks is the same as 1D and the routing algorithm should consider first ‘X’ and then ‘Y’ directions. XTRANC for the Y direction can be achieved by changing ‘X-’ to ‘Y-’, ‘X+’ to ‘Y+’ and ‘i’ to ‘j’.

5. Performance Evaluation Platform

The System-On-Chip Wire (SoCWire) [18, 19], which supports dynamic partial reconfiguration, has been employed in this work to build the networks for evaluation with different topologies. We have modified the SoCWire Switch to create a low overhead router called SoCWire Router [20] for regular large networks with many nodes adding logical addressing. To evaluate and verify the capabilities of the networks, we have designed an on-chip performance evaluation platform [20] and examined the partial reconfiguration in a physical prototype. We have designed and implemented a NoC performance evaluation platform on the FPGA to verify and evaluate the created networks around the Leon3 SoC available in the GRLIB IP Library [21]. The Leon3 is a SPARC-compatible softcore processor which is developed by AeroFlex-Gaisler and interfaces to the AMBA bus architecture. The IP blocks for the AMBA bus, Leon3 processor and other SoC peripherals are available in the GRLIB IP library. An additional IP block added to the GRLIB library is the reconfiguration controller [22] which is used to load new bitstreams stored in external DDR memory into the tiles without host intervention via the ICAP hardware unit (Internal Configuration Access Port) which allows direct access to device fabric. This external memory provides a good trade-off between memory size, transfer overheads and on-chip resource utilisation.

Fig.11. The performance evaluation platform for SoCWire networks

Fig.11 presents the architecture of this performance evaluation platform. We have considered the SoCWire Network as the dynamic part while the rest of the system is statically configured. This allows t he communication interconnect to map as a single block to a centralized area of the device which is connected to the evaluation platform. This approach is effective at using the current design flows of partial reconfiguration since it is possible to use the available FPGA resources optimally. If the number of local/communication ports change in a router the wiring infrastructure will change as well and this can be achieved by letting the P&R FPGA tools manage the resources in the assigned communication area without imposing excessive constraints.

Dynamic Partial Reconfiguration (DPR) is useful for systems with multiple functions that can time-share the same FPGA resources. We have used the DPR technique to create the dynamic part of the system while the rest of the evaluation platform is statically configured. The resulting device layout is shown in Fig.12 with the area occupy by the communication network clearly shown.

Fig.12. The resulting device layout

In this work we consider one single rectangular area that implements the network with the processing elements located outside of this area. This method is efficient for small to medium networks. It is possible to use partitioning methods with DPR to include multiple area group and non-rectangular PR regions [23] suitable for larger networks. This is part of our future work.

Our performance evaluation system is based on different traffic types representing synthetic and realistic application traffic patterns. Performance is analyzed in terms of throughput and latency after implementing the system in the target board running at a normalized frequency of 100 MHz. The throughput shows the efficiency of delivering packets in the network and depends on topology and routing policy which is ‘X-Y’ in all the cases to be able to perform a fair comparison. The time required for traversing the network is referred to as the latency of a network. The average latency is the mean of the latencies of all received packets in the topology [17].

6. XTRANC for homogeneous networksThe partitioning of a mesh network into an arbitrary numbers of smaller networks based on the inner-torus topology and the XTRANC routing algorithm could be a good solution to increase a mesh network performance. For example, Fig.13 shows a typical 10×10 mesh network which has been partitioned to four inner-tori, 25 nodes each and each inner-torus uses XTRANC as a routing algorithm for 2D topology.

Fig.13. A typical 10×10 network which is partitioned to inner-torus networks

6.1. Evaluation of an inner-torus networkFig.14 shows one partition of an inner-torus with 16 nodes which has been connected to our performance evaluation platform introduced the previous section and prototyped on a Virtex-5 LX110T device. We have evaluated a 16 nodes mesh and full-mesh (interconnected mesh) [11, 12] networks to compare with the inner-torus. Fig.15 displays the full-mesh network [11, 12].

Fig.14. A 4×4 inner-torus partition Fig.15 A 4×4 full-mesh network

6.1.1 Power and Maximum Achieved Frequency Table 1 shows the maximum achieved frequency and the consumed power for the networks. The XPower tool, which is the Xilinx power analysis tool, has been used to estimate the consumed power of the networks. We have considered the hierarchy power, which includes DCM power, BRAM power, signal power and logic power, to compare the networks. As it can be seen in this table, the maximum achieved frequency for the mesh network is 10.01% higher than the inner-torus network and for the inner-torus is 6.6% higher than the full-mesh network. The maximum achieved frequencies of the networks are highly dependent on the mapped area. In addition, the inner-torus network consumes more power than the mesh network but it reduces power by 38.1% compared to the full-mesh network.

TABLE 1THE MAXIMUM ACHIEVED FREQUENCY AND THE CONSUMED POWER

Mesh Inner-torus Full-meshMax. Frequency(MHz) 119.9 107.77 100.65

Power(mW) 40.93 70.17 113.36

6.1.2 Performance We have evaluated the networks with two different traffic patterns:

A. Uniform Traffic Pattern 1In this scenario, each node sends 100 packets, with 15 flits each, to random targets which are the other nodes. There is a random timing interval between the sending packets which indicate the average time between data transfer requests. We have evaluated the networks with different ranges of random firing intervals. A lower value indicates a more heavily loaded NoC.

Fig.16 displays the throughput and Fig.17 shows the average latency for the 4×4 inner-torus, full-mesh and mesh networks for this uniform traffic pattern. These figures reveal that the inner-torus network increases the throughput up to 25.94% and reduces the average latency up to 79.2% compared to the mesh network. As it can be seen in these figures, the inner-torus network increases the throughput up to 1.7 % and reduces the average latency up to 13.7% compared to the full-mesh network. This indicates that the inner-torus network achieves a performance comparable to the full-mesh despite of using less resources and being more power efficient.

Fig.16. The networks throughput for the uniform traffic pattern 1 Fig.17. The networks average latency for the uniform traffic pattern 1

B. Uniform Traffic Pattern 2In this case, each node sends 1-100 packets, 15 flits each to a random target which are the other nodes. The traffic is bursty traffic and there is not a random timing interval between the sending packets. Fig.18 displays the throughput and Fig.19 shows the average latency for the 4×4 inner-torus, full-mesh and mesh networks in this scenario. These figures reveal that the inner-torus network increases the throughput between 23.4 to 59% and reduces the average latency between 28.2 to 40.7% compared to the mesh network for the applied test. In addition, the inner-torus network increases the throughput between 6.1 to 63.9 % and reduces the average latency between 14.5 to 22.2% compared to the full-mesh in this scenario.

Fig.18. The networks throughput for the uniform traffic pattern 2 Fig.19. The networks average latency for the uniform traffic pattern 2

6.1.2 Complexity Table 2 shows the networks complexity summary. As it can be seen in the table, the configuration with the inner-torus network consumes 44% LUTs compared to the traditional mesh network. The additional logic is required due to the extra ports in the inner routers. The full-mesh is, as expected, even more complex and it requires 53.8% more LUTs compared to the inner-torus network.

TABLE 2COMPLEXITY FOR THE 4×4 INNER-TORUS, FULL-MESH AND THE MESH NETWORKS

Networks Mesh Inner-torus Full-mesh

Slice Registers 10864 13029 19294Slice LUTs 16828 24237 37285Block RAM/FIFO

48 56 80

7. XTRANC for irregular networks

It is possible to add extra link(s) only to areas where the network has congestion. The extra links are added to the mesh network in ‘X’ and ‘Y’ direction between the nodes which have long latency problems. The key point here is to detect the area which requires extra link(s) and replace the network or a part of network with the modified ones which has extra link(s). Note that adding one extra link in the current setup will require a reconfiguration of the network area as seen in Fig.12.7.1. Possible mechanism to detect and add extra link(s) to the network at the run-time

This section introduces a possible mechanism to detect and add extra link(s) at run-time in commercial FPGAs using the state of art Dynamic Partial Reconfiguration technique.In previous work [24] a distributed stochastic run-time strategy is presented that incrementally maps an application onto a large run-time reconfigurable multicore platform. This method uses a mesh network as a communication infrastructure. The proposed task mapping employs specific tasks named task managers (TMs) for mapping applications. Each application has its own TM which monitors its task during its execution to perform fault detection, task migration and energy management. The TM could have the capability to map tasks on demand and also map more than one application to reduce the overhead caused by the TM, but for the sake of simplicity and clarity, this paper considers one task manager for each application.

In this scenario, when a network has more than one task manager (TM), each TM maps its tasks and is not aware of the other TMs and some applications have an unpredictable traffic patterns. Thus, the communication latency is unpredictable due to the congestion. In addition, some nodes need to receive the data from the other nodes less than a specific latency but due to the created congestion; the node receives the data with a longer latency. Therefore, a novel method is needed to control the communication latency between the nodes and solve this problem. In this method when a node sends a packet to another node, the sending time stamp is added to the header of the packet. When a node receives a packet, it extracts the sending time and compares it to the current time and makes a decision. There are two decisions which can be made at this point: If the latency is less than the expected and acceptable latency, the node accepting the packet and if the latency is more than expected latency, the related node sends a message to its TM and asks for help. This message contains the sender and receiver information. When a TM receives the help message from a node, it has five tasks to do:

1. TM orders its nodes and other TMs to inform their nodes a reconfiguration is in process. The nodes communications pause immediately when they receive the command.

2. TM is aware of the different configurations bitstreams which have been stored in external DDR memory. It calculates which configuration with extra link(s) should be selected to improve the latency.

3. TM waits for a specific amount of clock cycles for the previous packets inside of the network to be delivered. This interrupt can vary depending on the network size and traffic pattern.

4. TM orders to replace the network or part of the network with a mesh network with extra link(s) using DPR technique.

5. TM sends a message to its related nodes and other TMs to inform them the network reconfiguration has done to start the communications.

All of the above process happens in the Network Interface (NI) level as well as TM and IPs do not have any overhead for this process.

7.2. Irregular network performance analysis

In this section, we have considered the whole and or a part of the network as the dynamic part that needs to be reconfigured while the rest of the system is statically configured.The experiments in 7.2.1. to 7.2.3. demonstrate that the XTRANC is a good routing algorithm capable of reducing the communication latency between specific nodes. We have considered three cases as proof of the concept:

7.2.1 Case 1:2 TMs with 1 task each and 1 extra linkWe have considered a network area where two different tasks have been mapped at run-time. Fig.20 shows these two simple task graphs and Fig.21 the mapped tasks on the mesh network. We have connected the sub-network to the performance evaluation platform, considered the evaluation platform as static and the network as the dynamic part and implemented the system in the FPGA. We have assumed all the communications between

nodes meet their latency limitations except the communication between node ‘f ‘and ‘g’. We have used PR and replaced the mesh network with the mesh network with an extra link which can be seen in Fig.22. Table 3 shows the complexity and maximum achieved frequency of the mesh and the mesh network with an extra link in this case. This table shows that adding an extra link requires 7.5% more LUTs. In addition, as we mentioned before the maximum achieved frequency is highly dependent on the area which have been selected to map the networks. In this case, adding an extra link does not have an effect on the maximum achieved frequency.

Fig.20. Two simple task graphs.

Fig.21. Run-time mapped applications on mesh network Fig.22. Run-time mapped applications on mesh network with one extra link

TABLE 3. OCCUPIED AREA AND MAXIMUM FREQUENCY OF THE NETWORKS

Networks Mesh Mesh-4x4-1-extra-link

Slice Registers 10864 11168

Slice LUTs 16828 18094

Block RAM/FIFOs 48 49

Max frequency (MHz)

108.719 108.684

Fig.23. The latency between nodes ‘f’ and ‘g’ with mesh Fig.24.The average network latency between nodes for the mesh and mesh with 1 extra link networks and mesh with 1 extra link networks

As it can be seen in Fig.23, the average latency between node ‘f’ and ‘g’ decreases by 30.42% by adding the extra link when each transaction contains a packet with 16 flits. Fig.24 shows the average latency for the networks. As it can be seen in Fig.24, adding the extra link reduces the average network latency by 8.58%.

7.2.2. Case 2: 2 TMs with 1 task each and 2 extra linksWe have considered a network area with the same two tasks as the previous case but mapped differently as it can be seen in Fig.25. We have connected the sub-network to the performance evaluation platform, considered the evaluation platform as the static and the network as the dynamic part and implemented the system on the FPGA. We have assumed all the communications between nodes meet their latency limitations except the communication between node ‘f ‘and ‘g’. We have used the PR and replaced the mesh network with the mesh network with two extra links which can be seen in Fig.26. Table 4 shows the complexity and maximum achieved frequency for the networks. This table reveals that adding two extra links in this case, increases the consumed LUTs by 16.42% in comparison to the mesh network. As it can be seen in Fig.27, the average latency between the node ‘f’ and ‘g’ decreases by 38.67% by adding the extra links when each transaction contains a packet with 16 flits. Fig.28 shows the average latency for the networks. As it can be seen in this fig, adding the extra links reduce the average network latency.

Fig.25. Run-time mapped applications on mesh network Fig.26. Run-time mapped applications on mesh network with two extra link

TABLE 4. COMPLEXITY AND MAXIMUM FREQUENCY OF THE NETWORKS.

Networks Mesh Mesh_2extra_links

Slice Registers 10864 11438

Slice LUTs 16828 19591

Block RAM/FIFO 48 50

Max frequency (MHz) 103.842

100.827

Fig.27. The latency between nodes ‘f’ and ‘g’ with mesh. Fig.28.The average network latency between nodes for the mesh and mesh with 2 extra links network and mesh with 2 extra links network

7.2.3 Case 3: 1 static partition, 1 dynamic partition , 1 TM with 1 task for each partitions, 1 extra linkWe have considered 2 network areas with 8 and 6 nodes which have one TM each and each of them is responsible for one task and have connected the networks to the performance evaluation platform, considered the evaluation platform and sub-network with 6 nodes which has been considered as a static part and the sub-network with 8 nodes which has been selected as the dynamic part. Fig.29 shows the TGs and Fig.30 displays how the tasks have been mapped to the 2 sub-networks. We have assumed all the communications between nodes meet their latency limitations except the communication between nodes ‘g ‘and ‘a’. We have used the PR and replaced the mesh network with the mesh network with one and two extra links which can be seen in Fig.31 and Fig.32 respectively. Note that a direct link between ‘g’ and ‘a’ is not compatible with XTRANC.

Fig.29. Two task graphs.

Fig.30. Run-time mapped applications on mesh network.

Fig.31. Run-time mapped applications on mesh network with one extra link.

Fig.32. Run-time mapped applications on mesh network with two extra links.

Table 5 shows the complexity and maximum achieved frequency of the sub-networks with 8 nodes. This table reveals adding one and two extra links needs 6.2% and 15.36% more LUTs compared to the mesh network. Fig.33 shows the average latency between node ‘g’ and ‘a’. This figure shows that adding an extra link reduces the average latency from node ‘g’ to node ‘a’ by 15.4%. In addition, adding two extra links reduces the average latency from node ‘g’ to node ‘a’ by 17.2% and from node ‘a’ to node ‘g’ by 31.3%. Fig.34 displays the average networks latency. Adding one and two extra links reduce the average latency.

TABLE 5COMPLEXITY AND MAXIMUM FREQUENCY OF THE MESH AND THE NETWORKS.

2X4_mesh 2X4_mesh_1-extralink

2X4_mesh_2-extralinks

Slice Registers 5875 5878 6129Slice LUTs 8947 9497 10321

Block RAM/FIFO 26 27 28Max

Frequency( MHz)119.36 124.347 124.611

Fig.33. The latency between nodes ‘g’ and ‘a’ with mesh Fig.34.The average network latency between nodes for the mesh and mesh with 1 and 2 extra links networks and mesh with 1 and 2 extra links networks

8 Conclusions

We have created the XTRANC algorithm as a suitable routing for dynamically reconfigurable NoCs formed by mesh networks extended with additional links. The experimental analysis reveals that XTRANC achieves better performance and remains deadlock-free despite the topology changes introduced by the additional links. XTRANC can be used to improve the use of limited hardware resources in FPGAs by adding links only to areas of traffic congestion and dynamically reconfiguring the links as requirements changes.

The results indicate that the partial reconfiguration features available in modern FPGAs could be used to deploy different interconnects at run-time depending on active application and design goal. The current work is based on a Xilinx V5 LX110T that with 69K logic cells does not offer enough resources to build larger systems but with new FPGAs such as the latest Xilinx Virtex-7 with millions of logic cells, it will be possible to create very large communication networks with hundreds of processing elements and study the scalability of this method in these cases. This is part of our future work. In addition, the IRN produces both deterministic and adaptive routing algorithms. In this work we have employed the deterministic XTRANC because of its inherent simplicity being based on the ‘X-Y’ algorithm. Future work will consider adaptive XTRANC and compare it with deterministic XTRANC and other adaptive algorithms.

References

[1] Ababei, C.: ‘Efficient Congestion-Oriented Custom Network-on-Chip Topology Synthesis’ Reconfigurable Computing and FPGAs (ReConFig), 2010 International Conference on, Cancun, Quintana Roo, Mexico, Dec. 2010, pp.352- 357.

[2] Duato, J., Yalamachili, S., and Ni, L.:’ Interconnection Networks: An Engineering Approach’. Morgan Kaufmann, 2003.

[3] Rahmati, D., Sarbazi-Azad, H., Hessabi, S., and Eslami Kiasari, A.: ‘Power-efficient deterministic and adaptive routing in torus networks-on-chip,’ Microprocessors and Microsystems - Embedded Hardware Design, 2012, 36, (7), pp. 571-585.

[4] Palesi, M. , and Daneshtalab, M. (Eds.):’Routing Algorithms in Networks-on-Chip, Springer’, 2013.

[5] Chiu, C. M.: ’The odd-even turn model for adaptive routing’. IEEE Trans. on Parallel and Distributed Systems, 2000, 11, (7), .pp.729 - 738.

[6] Hu, J., C., and Marculescu, R.: ‘DyAD - smart routing for networks-on-chip’, In Proc. Design Automation Conference, July 2004, pp. 260 - 263.

[7] Shafiee, A. M., Montazeri, M., and Nikdast, M.: ‘An Innovational Intermittent Routing Algorithm in Network-on-Chip’, International Conference on Computer Science and Engineering, France, September 2008.

[8] Li, M., Zeng, Q., Jone, W.: ‘DyXY - a proximity congestion-aware deadlock-free dynamic routing method for network on chip’, Design Automation Conference, 43rd ACM/IEEE, San Francisco, CA, USA, July 2006, pp. 849-852.

[9] Patooghy, A.; Miremadi, S.G.: ‘XYX: A Power & Performance Efficient Fault-Tolerant Routing Algorithm for Network on Chip,’ Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on, Weimar, Germany, Feb 2009, pp. 245-251.

[10] Nickray, M., Dehyadgari, M., Afzali-Kusha, A.: ‘Adaptive routing using context-aware agents for networks on chips’, Design and Test Workshop (IDT), 2009 4th International , Riyadh, Saudi Arabia, Nov. 2009, pp. 1-6.

[11] Choudhary, S. and Qureshi, S.: ‘A new NoC architecture based on partial interconnection of mesh networks’ in IEEE Symposium on Computers & Informatics (ISCI), Kuala Lumpur, 2011.

[12] Choudhary, S. and Qureshi, S.: ‘Performance Evaluation of Mesh-based NoCs: Implementation of a New Architecture and Routing Algorithm’, International Journal of Automation and Computing , 2012, 9, (4), pp. 403-413.

[13] Reshadi, M., Khademzadeh, A., Reza, A., and Bahmani, M.: ‘A Novel Mesh Architecture for On-Chip Networks’, ‘D & R Industry Articles,’ http://www.design-reuse.com/articles/23347/on-chipnetwork, accessed Sep. 2013.

[14] Fuks, H., and Lawniczak, A.T.: ’Performance of data networks with random links’, In Mathematics and Computers in Simulation, 1999, 51, (1-2), pp. 101-117.

[15] Ogras, U. Y., and Marculescu, R.: ‘It's a small world after all: Noc performance optimization via long-range link insertion’, IEEE Trans. on Very Large Scale Integration Systems, Special Section on Hardware/Software Codesign and System Synthesis, 2006, 14, (7), pp. 693-706.

[16] Jackson, C.; Hollis, S.J.: ‘Skip-links: A dynamically reconfiguring topology for energy-efficient NoCs’, System on Chip (SoC), 2010 International Symposium on , Tampere, Finland, Sep.2010 , pp. 49-54.

[17] Dally, W.J., and Towles, B.P.: ‘Principles and practices of interconnection networks’, 2004, The MorganKaufmann series in computer architecture and design. Morgan Kaufmann, Burlington.

[18] ‘System-on_Chip Wire, IDA, 5 May 2009’, www.socwire.org, accessed Sep. 2013.

[19] Osterloh, B., Michalik, H., and Fiethe, B.: ‘SoCWire: A Robust and Fault Tolerant Network-on-Chip Approach for a Dynamic Reconfigurable System-on-Chip in FPGAs,’ in Architecture of Computing Systems - ARCS 2009. 2009, vol 5455, Delft, Netherlands: Springer Berlin / Heidelberg, pp. 50-59.

[20] Beldachi, A.F., Hosseinabady, M., and Nunez-Yanez, J.: ‘Configurable Router Design for Dynamically Reconfigurable Systems based on the SoCWire NoC’ International Journal of Reconfigurable and Embedded Systems (IJRES),2013, 2, (1).

[21] ‘Leon3 Processor/GRLIB’, http://www.gaisler.com, accessed Sep. 2013.

[22] Nabina, A., Nunez-Yanez, J.: ‘Dynamic Reconfiguration Optimisation with Streaming Data Decompression,’ FPL, 2010 International Conference on Field Programmable Logic and Applications, Milan, Italy, Sep. 2010, pp. 602-607.

[23] ‘Xilinx website’, http://www.xilinx.com/support/answers/25018.htm, accessed Sep. 2013.

[24] Hosseinabady, M., and. Nunez-Yanez, J. L. ‘Run-time stochastic task mapping on a large scale network-on-chip with dynamically reconfigurable tiles’ Computers & Digital Techniques, IET,2012, 6, (1), pp. 1 – 11.

http://www.xilinx.com/support/answers/25018.htm

Date post:	27-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

University of Bristol · Web viewDiametrical 2D mesh [13] adds eight extra links to mesh topology...

Documents