Traffic Managers in Stratix II Devices

White Paper

Implementing Traffic Managers in Stratix II Devices

February 2004, ver. 1.0 1

WP-STXIITRFC-1.0

Introduction

Bundling voice, video, and data services provides carriers with a steady revenue stream while reducing customer turnover. Delivering these services through a common infrastructure reduces a carrier’s operational expenditures. In addition, “future-proofing” networks with flexible solutions that enable the delivery of enhanced services down the road enables carriers to limit their long-term capital expenditures. These factors are the motivation behind the increasing focus on guaranteeing Quality of Service (QoS) through traffic management.

Altera’s Stratix™ II family continues the trend of using FPGAs for traffic managers because of the inherent flexibility of FPGAs and because the Stratix II architecture has been optimized for performing traffic management functions. In addition to the extensive internal memory and I/O pins, the Stratix II device offers substantial density and performance improvements. These improvements are attributed to both technology advancements and architectural optimizations.

This white paper discusses traffic management and the implementation of traffic management functions within Stratix II devices. This white paper also provides an analysis of several of these functions, including scheduling and queue management, and describes improvements within the Stratix II architecture that optimize these functions. Additionally, because of the importance of memory management in traffic management, this paper discusses memory and memory interfacing.

Traffic Management Background

Traffic management enables bandwidth management through the enforcement of service level agreements (SLAs). SLAs define the criteria a network must meet for a specified design, including:

Guaranteed bandwidth, or throughput, including minimum, average, and peak guarantees of bandwidth availability.

Packet loss the number or percentage of packets sent but not received, or received in error.

Latency the end-to-end delay of packets.

Jitter the delay variation between consecutive packets.

Implementing Traffic Managers in Stratix II Devices Altera Corporation

2

Figure 1 shows a typical line card. The packet processor classifies the ingress data traffic (data traveling from the line side toward the switch) and determines which port the data should exit. The data header is also modified by the packet processor, which adds the appropriate class. The traffic manager uses this header to enforce the Service Level Agreement (SLA) that defines the criteria that must be met for a specified customer or service. With egress traffic (data traveling from the switch to the line side), the traffic manager smoothes large spikes in traffic, allowing the overall network to run more efficiently. A traffic manager in the data path (as shown in Figure 1) is considered to be in “flow-through” mode. An advantage of this mode is it reduces the complexity of the packet processor by offloading the packet buffering. A traffic manager outside the data path is considered in “lookaside” mode. In this mode, the packet processor communicates to the traffic manager through a “lookaside” interface and receives scheduling information, but the packet processor is also the interface to the backplane transceiver. In this mode the packet processor buffers the packets, and the traffic manager is responsible for maintaining the descriptor tables..

Figure 1. A Typical Line Card Block Diagram

Figure 2 shows a block diagram of a generic traffic manager. Not all traffic managers implement all the functions shown in the figure. Some of the functions in the diagram may also be implemented within the packet processor.

The data arriving into the traffic manager can be complete variable-length packets or fixed-length cells. Many traffic managers or packet processors segment packets into fixed-length cells, because switch fabrics can be optimized for switching those fixed-sized cells. The modified header of incoming data traffic allows traffic managers to prioritize and decide which packets should be dropped and retransmitted, when packets should be sent to the switch fabric, and how traffic should be shaped when sending it to the network.

Altera Corporation Implementing Traffic Managers in Stratix II Devices

3

Figure 2. Generic Traffic Manager Block Diagram

Computationally Intensive Functions: Scheduling

This section includes details on the implementation of several scheduling functions found in traffic managers and descriptions of the advantages gained from the Stratix II architecture.

A scheduler has four basic parameters:

The number of priority levels. The type of service (work-conserving or nonwork conserving). The degree of aggregation. The service order within a priority level.


4

If a scheduler supports priorities, it serves a packet from a priority level only if there are no packets waiting for service in an upper priority level. With such a scheme, connections that require QoS and are intolerant of delays can be serviced with higher priority than others. A potential problem with a priority scheme is that a user at a higher priority level may increase the delay and decrease the available bandwidth for connections at all lower priority levels. An extreme case of this is starvation, where the scheduler never serves a packet of a lower priority level because there is always something to send from a higher priority level. In an integrated services network, at least three priority levels are desirable: a high priority level for urgent messages, usually for network control; a medium priority level for guaranteed service traffic; and a low priority level for best-effort traffic. The VoQ structure can also be implemented outside the FPGA in SRAM, because the queue overflow generally occurs in the input stage of the switch . In this case, scheduling must be fast, and the switch resource is not available.

There are two types of scheduling service: work conserving and nonwork conserving. A work-conserving scheduler is idle only when there is no packet awaiting service. In contrast, a nonwork-conserving scheduler is idle even if it has packets to serve so as to shape the outgoing traffic to reduce traffic burstiness and the delay jitter. The work-conserving discipline is more suitable for best-effort traffic (Internet Protocol (IP) traffic), and the nonwork-conserving discipline is better applied to guaranteed-service traffic (voice and video). The new integrated network systems need schedulers that serve both types of traffic. The flexibility of the Stratix II FPGA architecture allows the implementation of both types of traffic.

The WRR Algorithm

The weighted round robin (WRR) scheme assigns different priorities to different queues. The selection policy involves selecting the queues according to their priority and is based on the SLA agreement or type of traffic. One way to implement this scheme is by maintaining urgency counters for each queue. Each urgency counter maintains a value that represents the weighting of its queue. This section discusses the implementation of this algorithm in Stratix II devices.

To perform the selection, the WRR algorithm increments the urgency counters of all the active queues by their respective weights, or priority. The active queue with the highest value in the urgency counter is selected, and then this urgency counter is decremented by the sum of all the weights of the active queues. The algorithm for WRR is

1. Choose all the active queues 2. Update the counters of all the active queues Counti (active) =

Counti (active) + Wi 3. Select the maximum count , Counti (active) max

4. Normalize, Counti (active) max = Counti (active) max -

Wi(active)

Where Counti is the urgency counter value , Wi is the weight and the index and "i" is the queue.

This paper analyzes in detail step 3, selecting the maximum urgency counter (Max Ci ), because it has the highest arithmetic complexity. In this example, the entire list of active urgency counters must be sorted and the maximum urgency counter must be selected.


5

This examples assumes 32 virtual output queues (VOQs) with 4 priority levels each, producing 128 priority queues (32 VOQs x 4 priorities = 128 priority queues). To determine the length of the urgency counter, the queue ID and queue weights need to be computed. The queue ID for these 128 queues requires 7 bits (log2 128 = 7 bit Qid). In this example, each queue can have a weighting that ranges from 0 to 512, which requires the queue weights (Qw) to be 9 bits (log2 512= 9 bit Qw). In addition, an extra bit should be added to the urgency counter to handle negative values. The size of the urgency counter in this example is 9+7+1 bits, or 17-bits wide. Each scheduling decision requires the WRR scheduler to sort and select from the 128 different 17-bit urgency counters.

There are several implementation possibilities for sorting these urgency counters. This example shows an array architecture implementation. The array architecture for comparing the urgency counters uses a 128x17 matrix of 2-bit comparators, as shown in Figure 3. The matrix determines the maximum value of the 128 urgency counters. The horizontal rows in the matrix represent a bit slice of the 17-bit urgency counters. The vertical columns in the matrix represent the urgency counter values for each of the 128 urgency counters.

Each of the comparators takes three inputs, the urgency counter bit value, the bit slice result, and a disable signal. The cells in the first row and column are all enabled, allowing each to participate in the comparison. Once a cell is disabled, the disable signal propagates through the rest of the cells of the urgency counter (for example, this urgency counter is essentially removed from comparison). The array architecture performw the comparisons for all 128 bits of a bit slice in parallel, determining the maximum bit of the enabled queues for that bit slice. Then the next bit slice begins comparison. The maximum urgency counter is obtained after computing the maximum bit of the enabled queues at each of the 17 bit slices.


6

Figure 3. Array Architecture for Urgency Counter Implementation of WRR

Counter1 CounterN

(0,1) n(0,1)

disable(0,1)

(0,2) n(0,2)

disable(0,2)

(0,N) n(0,N)

disable(0,n)

Bit 0

M(0)

M(0,1)

M(0,N)

k-1,0 n(0,1)

disbl(k-1,1)

k-1,1 n(0,2)

dis(k-1,2)

k-1,Nn(0,N)

disbl(k-1,n)

Bit k-1

Mk-1

M(k-1,1)

M(0,N)

k-2,0 n(0,1)

dible(k-2,1)

k-2,1 n(0,2)

disbl(k-2,2)

k-2,Nn(0,N)

disbl(k-2,n)

Bit k-2

Mk-2

M(k-2,1)

M(0,N)

Counter2

As shown in Figure 3, the large number of comparators are essentially three-input functions. Designers can configure the Stratix II adaptive logic module (ALM) to implement two look-up tables (LUTs) with the same or different number of inputs. For example, when implementing a function of three or less variables in a traditional four-input LUT structure, the unused portion of the LUT is wasted. For Stratix II devices, the unused portion of the LUT can be reused, allowing the implementation of a separate function that has up to five independent inputs. See Figure 4. This provides greater efficiency by allowing the combination of comparator functions with other functions within the same LUT.


7

Figure 4. Stratix II Device LUT Packing

In addition, the Stratix II architecture is optimized for wider fan-in functions. For example, designers can implement the 128 input AND gates required in the array architecture in 27 Stratix II LEs with three levels of logic, as opposed to 53 LEs with four levels of logic using purely four-input LUTs.

The WRR algorithm described also requires computation of the sum of all the individual weights. For example, this can be done by using a pipelined arithmetic addition scheme that uses the Wi’s and the queue activity status to calculate the ΣWi (active). In the example, for N=128 with a 16 stage pipeline, each adder must add eight weights that are 9-bits wide. The Stratix II architecture reduces the logic resources and summation stages by allowing the usage of three input adders inside an ALM (see Figure 5).

Figure 5. Three-Input Adder Implementation


8

The Memory Bottleneck: External Memory

While communication link speeds have grown in 4x increments approximately every two years, memory performance has only increased 10% per year. This has led to memory becoming a critical bottleneck in communication systems. Current traffic managers require large amounts of embedded memory as well as support for high-speed external memory sources.

The amount of external memory required is application dependent, but there are a few general guidelines. Because data is written into and read out of memory, memory throughput must be at least two times the line rate. If header information is added to data as it is processed, the throughput requirements increase up to four times the line rate. The total size of memory in many cases is bounded by the round trip time (RTT) of the transmission control protocol (TCP). This is the average round trip time between active hosts, and it can range from 200-300 ms. For example, a 10 Gbps (Gigabits per second) interface requires two to three Gbits of memory.

In many cases, a segmentation and reassembly (SAR) function is used to segment variable-length packets into fixed-length cells. Also, switch fabric performance is improved when switching is done with fixed-length cells. Assuming the switch fabric supports a fixed 64 bytes, the calculation for cells to process for an OC-192 stream is

[10 x 109 bits per second] / [64 x 8 bits per Cell] = 19,531,250 cells per second.

Traffic management applications use different external memory. Table 1 compares different external memories.

Table 1. External Memory Comparison

DRAM SRAM CAM Latency High Low Very Low

Density High Low Low

Cost Low High Very High

Power Low Medium Very High

Applications Packet Buffer Pointers, Flow Tables, Rate Tables Search, Classification

SDRAM is inexpensive and has high bandwidth but also higher latency compared to SRAM, so it is used for functions with very high density needs. SDRAM buffers the data as it is being processed. SDRAM also requires many pins. The example shown in Figure 6 shows the number of pins required to interface to a 64-bit SDRAM. Many types of high-end networking equipment require several of these devices, leading to very high pin requirements for traffic management devices.


9

Figure 6. SDRAM Pin Example

Part Description

Pin Name Function Total PinsA[0-12] Address bits 13BA[0-1] Bank Address 2

DQ[0-63] Data In/Out 64DQS[0-7] Data Strobe 8CK[0-2] Clock 3!CK[0-2] !Clock 3CKE[0-1] Clock Enable 2CS[0-1] Chip Select 2

RAS Row Address 1CAS Column Address 1WE Write Enable 1

DM[0-7] Data - in mask 8Total Pins 108

64-bitx128Mb

Stratix II FPGAs are available in advanced pin packages that provide board area savings as well as high pin counts. For example, the Stratix II device is offered in the 1,508-pin Fineline BGA package with up to 1,150 user I/O pins. These high pin counts provide easy access to I/O pins for implementing with external memory chips and other support devices in the system.

External Memory Bandwidth Analysis Example

When determining the appropriate memory requirements for buffering packets both the width and depth of the memory should be considered (see Figure 7).

Figure 7. Width & Depth of the Memory Subsystem

The required width is dependent on the memory’s throughput. For example, if a 32-bit wide memory device (data) has an access time of 20 ns (50 MHz), the throughput in terms of raw bits from this device is 32 x 50 x 106 bits per second, that is, 1.6 Gbps. It takes two cycles to access each 32-bit data word, one cycle for the read_enable signal or the write_enable signal, and one cycle for the address latching. On the third cycle one access reads a 32-bit data word. The overhead for each 32-bit word read/write for this device is 40 ns. The total read or write cycle takes 60 ns, so the effective throughput from this device is 32 x 1/60 x 109 bits per second = 534 Mbps. Figure 8 shows this in the sample timing diagram.

MEMORY SUBSYSTEM

Depth = K words

Width = N words


10

Figure 8. Memory Timing Diagram Sample

The timing diagram in Figure 8 shows that it takes a maximum time equal to tRC + tHCZE to read a data word from the port of this memory device. In burst access situations where more than one word is read from or written to memory, the timing overhead is reduced for control signals such as CE (chip enables) and read_enable or write_enable. However, there is a limitation in burst reads and writes. Consider the addresses from which data is accessed. Typically, in a four-word burst, data is accessed from addressN (base address), addressN+1 (base address + 1 word offset), addressN+2 (base address + 2 word offset),

addressN+3 (base address + 3 word offset), and so on.

The memory depth requirements are driven by the processing time required to forward each packet after performing scheduling and the output time for traffic.

Estimate the depth of the memory and other characteristics with a combination of the following:

The arrival rate of each bit or word into the memory. The departure rate of each bit or word out of memory.

If the word size for a memory subsystem is 16 bytes, the arrival rate is = 128/10 ns = 12.8 ns or .1 ns per bit for a 10Gbps card. Similarly, if the processing time for the scheduler and the traffic manager is 1 microsecond, the bit that arrived must be buffered for 1000 ns or there must be a buffer large enough to store 10,000 bits without dropping them from the ingress portion of the flow.

For sizing purposes the following applies:

Width of the memory subsystem = Packet Throughput / Frequency of operation

# of devices required of the given frequency :

Width of memory sub system / Data word width per device of the same frequency


11

Memory requirements are simplified for this example by assuming a 4x bandwidth increase in the line rate. This increase includes the read and write cycles as well as other latencies associated with burst accesses. The chart shows the required memory throughput for different line rates and memory bus widths.

Memory Bus Width 32-bit 64-bit 128-bit 256-bit 512-bit

OC-12 78 Mbps 39 Mbps 19 Mbps 10 Mbps 5 Mbps

OC-48 313 Mbps 156 Mbps 78 Mbps 39 Mbps 20 Mbps Line Rate OC-192 1250 Mbps 625 Mbps 313 Mbps 156 Mbps 78 Mbps

Stratix II devices deliver memory throughput requirements for high bandwidth applications by supporting advanced memory technologies, which are shown in Table 2.

Table 2. Stratix II External Memory Interface Support

Memory Technology I/O Std Max. Clock Max Data RateSDR SDRAM LVTTL 200 MHz 200 Mbps DDR SDRAM SSTL 200 MHz 400 Mbps

DDR II SDRAM SSTL 1.8V I, II 266 MHz 533 Mbps QDR II HSTL I, II 250 MHz 500 Mbps

RLDRAM-II HSTL I, II 300 MHz 600 Mbps

The Memory Bottleneck: Queue Manager

A queue manager buffers the incoming data traffic from the packet processor and creates tables of pointers to the buffered data. These buffers typically are located off-chip in external memory, but with embedded memories, portions of the queue manager buffers can be kept on-chip. This section discusses an implementation of a queue manager utilizing the internal memory blocks of Stratix II devices.

Internal Memory in Stratix II Devices

The use of internal SRAM reduces pins, power, board space, cost, and latency. Stratix II devices provide embedded TriMatrix™ memory that is capable of handling the traffic management memory requirements. For example, Stratix II devices offer up to 9 Mbits of memory. The TriMatrix memory consists of three types of memory blocks, M512, M4K, and M-RAM. The M512 block supports 512 bits of memory, the M4K block supports 4Kbits, and the M-RAM block supports up to 512 Kbits of memory per block.

Queue Manager

To implement the queue manager, map the internal M-RAM memory to the external memory. This address mapping can be done dynamically by creating a linked list structure in hardware. Or, memory can be allocated statically by dividing the external memory into fixed sized submemory blocks. There are advantages and disadvantages to both approaches. The dynamic approach is more flexible and allows for a better utilization of memory, while the static approach does not incur the overhead of a linked-list structure and allows simpler handling of status signals. The information in this section describes the static memory allocation approach only. Refer to Figure 9 for an example of a statically allocated memory implementation.


12

Figure 9. Static Memory Allocation

Memory AddressPointers Stored

in M-RAM

Data Stored in ExternalRAM

Each queue/flow has a single entry, referred to as the queue entry, in the M-RAM. The following information describes the queue:

Status flags Head pointer (read) Tail pointer (write)

The status flags contain empty, full, almost empty, and almost full flags for each queue/flow. The head pointer stores the address location for the next read for the queue/flow in the external memory. The tail pointer stores the address location for the next write for the queue/flow in the external memory. Depending on the depth of the queue/flow required, the external memory is segmented into submemory blocks, which are controlled by each entry in the M-RAM, representing a single queue/FIFO.

For example, with an address width of 25-bits, designers can configure the M-RAM in 8k x 64-bit wide mode. The 64-bits would contain two 25-bit addresses and additional status flag bits. This configuration can manage up to eight thousand queues/flows. Merging multiple M-RAMs builds larger queue managers. The depth of the queues/flows are determined by the size of the external memory.


13

The following information provides an example of a read or write process to the multi-queue/flow FIFO.

The requirements for the example are

64 queues/flows Frame size of 64 bytes Queue/flow depth of 128 frames

The requirements shown in the list require each queue to be 8,192 bytes, and the entire memory to be 524,288 bytes, or 4 Mb. Memory is allocated to the first queue from 0-8,191 bytes, to the second queue from 8,192-16,384 bytes, and so on. When the pointers reach the upper limit of the allocated memory section, they loop back to the lower limit. The M-RAM is configured in the 8k x 64 mode to store a 64-bit wide queue entry.

Read & Write Operations

When a packet arrives at the queue, the scheduler determines which queue the packet is to be stored in, for example, queue 0-63. If a write to queue three is requested, the M-RAM accesses the queue entry at address location three. After the first clock cycle, the tail pointer is masked and sent to the external memory controller, along with the frame to be stored. The tail pointer is incremented by one frame size and operations are performed on the head and tail pointers to update the status flags. The updated pointers and status flag bits are written back into the M-RAM on the second cycle. The same occurs for a read request. Therefore, read and write requests take two cycles in the M-RAM, one cycle to obtain the external address to read or write from and one to update the pointers and status flags. The updated status flag bits for queue three are also sent to the queue manager for processing.

Status-signal generation and processing occurs immediately after a read (see Figure 10) or write (see Figure 11) request because status signals are embedded in the queue entry. (The alternative is to register individual status signals for each queue, which is not efficient. For example, empty and full flags with 8,000 queues would require 16k registers.) After signal generation and processing, the next step is to subtract the head pointer from the tail pointer and compute the absolute value. If the difference is zero, the queue is empty. If the difference is equal to the maximum depth of the queue, the queue is full. The queue manager must control the pointers so that the head (read) pointer never leads the tail (write) pointer and manage the queues when they become full and empty. If the queue is empty, the queue manager should ignore all reads from the external memory for that queue. Other intermediate status signals such as almost full and almost empty flags may be generated as well.


14

Figure 10. Read Operation

AddressCounter

StatusFlags

MemoryController

External Memory

M-RAMPolicer/Scheduler

Queue Address

Wr/Rd

Data

Frame

Queue Entry

Address

Data

1

23

4

56

8

7

Read Request

Wr/Rd

1. Read request to the scheduler or from within the scheduler. 2. Scheduler sends the appropriate queue address to read from the M-RAM. 3. The status flag is masked out and the head pointer is sent to the memory controller 4. Calculate appropriate status flags for the queue with pointer information. 5. Check the status flags to determine if immediate action is required, for example, the queue is empty,

and so on. If the queue is empty, a read from external memory is not required. 6. Send the head pointer to the address counter to increment to the next memory location. 7. Rebuild the queue entry and write the data into M-RAM.


15

Figure 11. Write Operation

AddressCounter

StatusFlags

MemoryController

External Memory

M-RAMPolicer/Scheduler

Queue Address

Wr/Rd

Data

Frame

Queue Entry

Header

Address

Data

Incoming Data

1

2

34

9

5

6

8

7

Wr/Rd

1. Incoming data from traffic. Mask out the header to the scheduler and the frame to memory controller.

2. The scheduler parses the header information and determines in which queue to place the frame. 3. Send a read request to M-RAM and the queue address to access. 4. Mask out the tail pointer that contains the address in the external memory and send the tail pointer

to the memory controller. 5. Send the tail pointer to the address counter to increment to the next memory location. 6. Calculate appropriate status flags for the queue with pointer information. 7. Check status flags to determine if immediate action is required, for example, the queue is full, and

so on. 8. Rebuild the queue entry and write the data into M-RAM.


16

Static Memory Allocation

For statically allocated memory, initialize the M-RAM with the submemory block starting addresses for each queue before startup. The M-RAM requires an initialization circuit to write the starting addresses for each queue/flow. This can be done using a state machine and a counter that increments by the depth of each queue/flow. Once the M-RAM has been initialized, the state machine sends a flag to the queue manager confirming that it is ready to operate as queue manager. Alternatively, use an external LUT to initialize the M-RAM. The external LUT has the starting address for each queue, which is read into the M-RAM to initialize the queue manager.

To determine the memory space for each queue (see Table 3), allocate memory space of 2x for each queue, where the memory space is divisible by the frame size. This simplifies the pointer address operations, because the counter increments by the frame size, and when it reaches the upper memory space limit, it automatically rolls over to zero or the lower limit. For example, if the frame size is 64 bytes (26) and the depth is 128 frames, each memory space is 8,192 bytes (213). The address can then be broken up into two parts, the static MSB portion that denotes which queue it belongs to and a dynamic LSB portion that changes as the specific queue is filled.

Table 3. Pointer Address Example

Static Queue Identifier Dynamic Frame Counter/Pointer for Queue 000000000000 0000000000000 000000000001 0000000000000

… … … …

111111111110 0000000000000 111111111111 0000000000000

The upper MSB bits remain the same for a specific queue; only the lower LSB bits change by the address counter. This keeps the address counter operation small and uniform for all queues, and more efficient.

The alternative is to have a special function handle the pointers for each queue once it reached its upper limit. For example, you can implement some sort of look up in logic, outside of the functions described, to reset the pointer to the lower limit. For example, implement a look-up function in logic (outside of the functions described) to reset the pointer to the lower limit.

Traffic Shaping

Traffic shaping is a mechanism that alters the traffic characteristics of a stream of packets/cells in order to make them conform to a traffic descriptor. A traffic descriptor is a set of parameters that describes the behavior of a data source. There are three parameters that describe the data source traffic:

The average rate The peak rate The burst size

Shaping the data source traffic to the above parameters means that the data source can send packets at the long-term average rate or it can send bursts at the peak rate. Traffic shaping is performed at the entrance nodes of the network, and the devices that shape the incoming traffic are called regulators.


17

Leaky Bucket Algorithm

The leaky bucket algorithm accumulates fixed-size tokens into a bucket at a defined rate. An incoming packet is transmitted only if the bucket has enough tokens. Otherwise, the packet waits in a buffer until the bucket has enough tokens for the length of the packet. Figure 12 illustrates the leaky bucket operation. As the figure shows, tokens are added to the bucket at the average rate. On a packet departure, the leaky bucket removes the appropriate number of tokens. If the incoming packets are segmented into fixed-size units and one token is removed from the bucket for a packet departure, then the size of the bucket corresponds to burst size.

By replenishing tokens in the bucket at the average rate and permitting the departure of contiguous packets one can control two of the three traffic parameters: average rate and burst size. To control the peak rate, add a second leaky bucket. If the token replenishment interval corresponds to the peak rate, and the token bucket size is set to one token, the second leaky bucket is a peak-rate regulator. The second leaky bucket is located before the first leaky bucket and is used to insert traffic that is conforming to peak rate. The second leaky bucket does not have a buffer. Instead of dropping the nonconformant packets it marks them and transmits them to the next leaky bucket. The marked packets are dropped in case of buffer overflow. If the next leaky bucket does not have a buffer to keep the nonconforming packets, it is called a policer. A policer drops the nonconforming or marked packets. A leaky bucket can be implemented as a calendar queue (a standard implementation) or a slotted wheel. The next section describes an example of a calendar queue.

Figure 12. Simple Leaky Bucket Model

Calendar Queue Implementation of Leaky Bucket

A calendar queue consists of a clock and an array of pointers, as shown in Figure 13. Each pointer corresponds to the list of packets that are serviced during this slot. The “initial” duration of a slot equals the calendar queue’s clock period. However, due to the variability in the number of the packets in each list, the time slot duration is variable. When all the packets of a slot’s list are serviced, the next slot becomes active. The pointer of the next slot indexes to its corresponding list of packets. A packet is inserted into the proper slot after the scheduler assigns a slot tag to it. A packet that must be serviced during a slot in the current round may be linked in the same list with a packet that must be serviced at the next round. The calendar queue size is estimated as follows:

# of slots x calendar queue clock period > period of slowest connection

Arriving 64 Byte Cells

Token Pool

Departing 64 Byte Cells

Token Generator


18

This algorithm is implemented using the Stratix II MRAMs, following a structure similar to that described in the section “The Memory Bottleneck: Queue Manager.” The memory structures of Stratix II devices enable the shapter to maintain the list of pointers inside the device, eliminating the off-chip delays and additional board space associated with external memories.

Figure 13. Simple Calendar Queue Mode

Statistics

Also called metering, statistics provide information on whether packets do not meet appropriate SLAs. You can also use metering to enable dynamic billing based on usage.

High-speed counters perform traffic management metering at high speeds, and the results of these counters are stored in memory. A hierarchical memory structure supports the large number of counters necessary for keeping statistics. Figure 14 shows an example of this type of memory.

You can implement the high-speed counters in the Stratix II logic elements. Such counters are capable of running at speeds of more than 300 MHz.

Calendar Days

1

2

3

N

A B C D

F E

Tasks


19

Another way to use internal memory is to create a hierarchical memory structure to support statistics counters. The need for hierarchical memory again arises from the external DRAM memory bottleneck. The throughput of current DRAM technologies cannot meet the requirements of updating numerous counters per cell at line rate due to the inherent latency of DRAM. This inherent latency requires temporary counters to be stored in SRAM; they are then used to update the external DRAM occasionally. In this case, the DRAM latency is only incurred periodically, and is determined by the size of the SRAM counters. The statistics engine updates the appropriate “small” counter values in SRAM as packets are received. Periodically, the statistics engine then reads the “large” counter values from external DRAM and adds the “small” counter values. Then it resets the “small” counter value to zero in the internal SRAM. The M4K blocks within Stratix II devices can be configured to temporarily store the count values for each counter. For example, for 64K flows, the M4K blocks can store up to three 8-bit counters for each flow. This reduces the number of times the external DRAM needs to be accessed by up to 28.

Figure 14. Hierarchical Memory for Statistics Engines


20

Conclusion

In today’s environment, ASIC or standard product solutions incur a significant amount of risk. Volumes are uncertain, which leads to exorbitant nonrecurring engineering (NRE) costs, as well as a limited number of ASSP providers that the market can support. FPGAs are a natural fit for implementing traffic managers because of the limitation of this risk and the ability to differentiate a traffic management solution. Additionally, you can use a reconfigurable solution to add and support new services in the future.

The advanced architecture of Stratix II devices coupled with the advantages of 90nm process technology migration enable the devices to service high-end traffic manager requirements. The enhanced fabric is optimized for the computationally intensive functions of traffic management. The support of flexible high-speed memory allows memory management at today’s highest rates, with support for future memory standards. The embedded memory structure of Stratix II devices enables storage of pointer tables into the large M-RAM blocks and statistic caches in M4K blocks. Stratix II devices offer a complete solution for implementing high-speed traffic management solutions.


21

References

[1] “Maintaining Statistics Counters in Router Line Cards”, Devavrat Shah, Sundar Iyer, Balaji Prabhakar, Nick McKeown. 2002 IEEE.

[2] “Efficient Per-Flow Queuing in DRAM at OC-192 Line Rate using Out-Of-Order Execution Techniques”, Aristides Nikologiannis and Manolis Katevenis.

[3] ATM Forum, “Traffic Management Specification Version 4,” Feb 1995.

[4] G. Kavipurapu, M. Nourani, “Switch Fabric Design and Performance Evaluation: Metrics and Pitfalls,” 46th MWSCAS, 2002.

[5] G. Kavipurapu, R. Gadiraju, M. Nourani, “System Requirements for Super Terabit Routing,” 45th MWSCAS, 2001.

[6] C. Labovitz, G. Malan, and F. Jahanian, “Internet routing instability,” IEEE/ACM Transactions on Networking, vol. 6, p.515-28, Oct. 1998.

[7] V. Paxson, “End-to-end routing behavior in the Internet,” IEEE/ACM Transactions on Networking, vol. 6, p.601-15, Oct. 1997.

[8] D. Aldous and J. Fill: Reversible Markov Chains and Random Walks on Graphs. Unpublished manuscript. http://stat- www.berkeley.edu/users/aldous/book.html (1999).

[9] M. Adler and C. Scheideler: Efficient Communication Strategies for Ad-Hoc Wireless Networks. In Proc. 10th Annual Symposium on Parallel Algorithms and Architectures (SPAA’98) (1998).

[10] J. Broch, D. B. Johnson, and D. A. Maltz: The dynamic source routing protocol for mobile ad hoc networks. IETF, Internet Draft, draft-ietf-manet-dsr-01.txt, Dec. 1998. (1998).

[11] I. Chatzigiannakis, S. Nikoletseas, and P. Spirakis: Analysis and Experimental Evaluation of an Innovative and Efficient Routing Approach for Ad Hoc Mobile Networks. In Proc. 4th Annual Workshop on Algorithmic Engineering, Sep. 2000. (WAE’00)(2000). Also see full paper at http://helios.cti.gr/adhoc/routing.html

[12] S. Ross, “Applied Probability Models with Optimization Applications,” Dover 1970.

[13] Padhye, Firou et. al, “Modeling TCP throughput: A simple model and its empirical Validation,” ACM SIGCOMM 1998.

[14] “Router Mechanisms to Support End-to-End Congestion Control”, Technical report, ftp://ftp.ee.lbl.gov/papers/collapse.ps.

[15] N.Shacham, “Multipoint communication by hierarchically encoded data”, Proc. of IEEE Infocom’92, (1992), pp.2107-2114.


22

[16] L.Vicisano, L.Rizzo, J.Crowcroft, “TCP-like congestion control for layered multicast data transfer”, Research Note RN/97/67, UCL, July 1997. http://www.cs.ucl.ac.uk/staff/L.Vicisano/rlc.ps .

[17] S. Jacobs and A. Eleftheriadis. Providing Video Services over Networks without Quality of Service Guarantees. In World Wide Web Consortium Workshop on Real-Time Multimedia and the Web, 1996.

[18] V. Jacobson. Congestion Avoidance and Control. SIGCOMM Symposium on Communications Architectures and Protocols, pages 314–329, 1988. An updated version is available via ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

[19] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley and Sons, 1991.

[20] I. Rhee, V. Ozdemir, and Y. Yi. TEAR: TCP Emulation at Receivers – Flow Control for Multimedia Streaming, Apr. 2000. NCSU Technical Report.

[21] G Kavipurapu, Y. Ragi, A. Chandra, Y. Ragi, “A low cost multi-threaded approach to a SIMD router simulator.” Also available at http://www.iris-technologies.net/docs/irishipc.pdf

[22] George Kornarosy, Christoforos Kozyrakisz, Panagiota Vatsolakiy, and Manolis Katevenis, “Pipelined Multi-Queue Management in a VLSI ATM Switch Chip with Credit Based Flow-Control,” Proc. of 17th Conf. on Advanced Research in VLSI (ARVLSI’97), Univ. of Michigan, Ann Arbor, USA, Sep. 1997

[23] Christoforos E. Kozyrakis , “The Architecture, Operation and Design of the Queue Management Block in the ATLAS I ATM Switch,” Institute of Computer Science (ICS), Foundation for Research and Technology – Hellas (FORTH) TR-172, July 1996.

[24] Balaji Prabhakar Nick McKeown and Ritesh Ahuja, “Multicast Scheduling for Input-Queued Switches,” IEEE JSAC, May 1996.

[25] Sundar Iyer, Nick McKeown, “Making Parallel Packet Switches Practical,” Infocomm 2001

[26] S. Q. Zheng y, Mei Yang y, and Francesco Masettiz, “Hardware Scheduling in High-Speed, High-Capacity IP Routers.”

[27] Nikos Chrysos , “Design Issues of Variable-Packet-Size, Multiple-Priority Buffered Crossbars.” FORTH-ICS /TR-325 October 2003.

[28] Aggelos Ioannou_ and Manolis Katevenis, “Pipelined Heap (Priority Queue) Management for Advanced Scheduling in High-Speed Networks,” ICC 2001.

[29] Georgios Passas, Performance Evaluation of Variable Packet Size Buffered Crossbar Switches” FORTH-ICS /TR-328 November 2003

[30] J. Blanton, H. Badt, G. Damm, and P. Golla, “Iterative scheduling algorithms for optical packet switches”, ICC 2001 Workshop, Helsinki, June 2001.


23

[31] G. Damm, J. Blanton, P. Golla, D. Verchere, and M. Yang, “Fast scheduler solutions to the problems of priorities for polarized data traffic” Proc. of International Symposium on Telecommunications (IST’01), Tehran, Iran, Sept. 2001.

[32] M. J. Karol, M. G. Hluchyj and S. P. Morgan, “Input vs. output queuing on a space-division packet switch”, IEEE Transaction on Communications, Vol. 35, No. 12, pp. 1347-1356, 1987.

[33] W. J. Cook,W. R. Pulleyblank, A. S., and W. H. Cunningham, Combinatorial Optimization, Wiley John & Sons Inc., Nov.1997.


24

101 Innovation Drive San Jose, CA 95134 (408) 544-7000 www.altera.com

Copyright © 2004 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries.* All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera’s standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.

Date post:	11-Apr-2017
Category:	Documents
Upload:	gautam-kavipurapu
View:	216 times
Download:	0 times

Traffic Managers in Stratix II Devices

Documents