+ All Categories
Home > Documents > Workload Characterization and Performance for a Network...

Workload Characterization and Performance for a Network...

Date post: 05-Jan-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
152
Workload Characterization and Performance for a Network Processor Mitsuhiro Miyazaki B.S., Osaka University (Japan), 1994 A Thesis Presented to the Faculty of Princeton University in Candidacy for the Degree of Master of Science in Engineering Professor Ruby B. Lee Research Advisor The Department of Electrical Engineering Princeton University June , 2002
Transcript
Page 1: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

Workload Characterization and Performance for a Network

Processor

Mitsuhiro Miyazaki

B.S., Osaka University (Japan), 1994

A Thesis Presented to the Faculty of

Princeton University

in Candidacy for the Degree of

Master of Science in Engineering

Professor Ruby B. Lee

Research Advisor

The Department of Electrical Engineering

Princeton University

June , 2002

Page 2: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very
Page 3: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

i

Abstract

The explosive growth of the Internet and e-business requires faster deployment

of high-bandwidth equipment, greater flexibility to support emerging Internet

technologies, and new services within the network. The design of routers is being

changed significantly by the emergence of Network Processors (NPs). With

programmable NPs, exceptionally fast packet processing at high-bandwidth is

achieved through the optimization of both the instruction set and data path.

Routers have to perform complicated protocol stack processing with demands

for various services. However, the fast packet switching path, namely the packet

forwarding path with table look-up, filtering, queuing assignment and input/output

scheduling, influences network performance more than the slow packet switching

path. This paper characterizes router processing with pseudo code based on the fast

path for an emerging network processor, the Intel IXP1200. It also addresses the

workload characterization of the fast path in routers. I expect that those

characterizations should be very useful to guide the architectural design of future

network processors. It should also be very beneficial for comparing the performance

of different solutions to fast path processing, using combinations of different

network processors, general-purpose processors and hardware ASICs.

Network Processors (NPs) are generally designed for edge or backbone routers.

Therefore, NPs need to be adapted to a wide range of networks. For high-end

networks, NPs may be assigned to OC-192 (10Gbps). However, such rates actually

are still outside the reach of existing NP products. In reality, the primary target of

Page 4: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

ii

current NPs could be up to OC-48 (2.49Gbps) wire-speed, which delivers

minimum-sized packets at 5.65 million packets per second (pps). In this paper, I

evaluate IXP1200 from the computer architect’s point of view, rather than the

network infrastructure point of view. First of all, this paper presents the

Instruction Mix (i.e. distribution of executed instructions) of Microengine at the

fast packet switching path on the basis of simulation results of a reference program

provided by Intel, and clarifies the types of instructions most important for NPs

from the perspective of instruction set architecture design. Next, it shows the

latencies in accessing external/internal resources such as SDRAM, SRAM, and the

receive FIFO buffer. Since IXP1200 can hide such latencies by context swap

instructions, we can easily figure out the benefit of it because memory access takes

generally numerous cycles and affects CPU efficiency. In addition, it evaluates CPI

(Cycle per Instruction) and the ratio of executing, aborted, stalled, and idle cycles,

which implies efficiency of multithreading and fast context swap in IXP1200.

Finally, this paper presents the throughput of the IXP1200 it can achieve in OC-48,

and shows comparison between IXP1200 and other well-known NPs regarding

context switch and branch mechanism.

Page 5: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

iii

Acknowledgments

I am indebted to my advisor, Ruby Lee, for the valuable feedback and help

throughout my entire thesis. I will always admire Ruby’s vision and inspiration, as

well as her kind-hearted personality.

Paul Huang, Abdulla F. Bushait, and Robert Miller have always been helpful to

me and have certainly made my life easier and more fun at Princeton. Thanks to

Aaron Moore, a close friend at Princeton to talk to about many different subjects.

Thanks to Shiro and Kuriko Okita, and Hidechika and Emi Koizumi for engaging

me in many interesting discussions and experience. I am also very grateful to Lidija

Lukic for providing me English advice and wonderful knowledge. Thanks to Richard

G. Knight for giving me precious and enjoyable information outside my field.

Special thanks to my parents Yukiko and Shigeto Miyazaki for their love, and

the desire they have instilled in me to learn and excel. Thanks to my sisters Chie

and Shiho for cheering me up. I’m also very grateful to my wife’s parents Yoshiko

and Yukio Kaneko for support and encouragement to finish this thesis.

Most of all, I thank my wonderful wife Atsuko who has helped me and had

patient throughout this entire process and deserves much credit for this thesis. She

has been very supportive, and I truly appreciate all that she has done for me while

working full time, and providing a warm and loving environment as a family.

Page 6: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

iv

Contents

1 Introduction 1

2 Network Configuration and Market for Network Processors 4

3 Router Processing and Workload Characterization 7

3.1 Router Processing 7

3.2 Workload Characterization and Proposal 11

3.3 Pseudo Code of Router Processing 15

3.3.1 Receive Packet Processing 17

3.3.2 Transmit Packet Processing 22

4 Network Processor Architecture 30

4.1 Microengine Architecture 31

4.2 FBI Unit Architecture and IX Bus Interface 34

4.3 Microengine Pipelining 36

4.4 Memory Access 38

4.5 Branch and Context Switch Mechanism 40

4.5.1 Class3 Instructions 41

4.5.2 Class2 Instructions 42

4.5.3 Class1 Instructions 43

4.5.4 Solutions for branch penalties 44

5 IXP1200 Network Processor Evaluation 47

5.1 Methodology 47

5.2 Instruction Mix 49

Page 7: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

v

5.3 Latency 52

5.4 Execution, Aborted, Stalled, and Idle Ratio 56

5.5 CPI (Cycle per Instruction) 59

5.6 Throughput 60

6 Other Network Processors 65

6.1 Lexra’s NetVortex 66

6.2 Motrola’s C-5 68

6.3 IBM’s PowerNP 70

7 Conclusions and Future work 72

8 Bibliography 74

Appendix A Pseudo Code 76

Appendix B Microengine Instruction Set 102

Appendix C Instruction Mix Data 106

Appendix D Latency 111

Appendix E Multithreading Example 131

Appendix F Theoretical Throughput Calculation for IP Packets 132

Appendix G Instruction Set of Other NPs 137

List of Figures 140

List of Tables 143

Page 8: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very
Page 9: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

1

1. Introduction

The network bandwidth is a critical resource in the Internet in resent years.

Routers perform key functions to accommodate increasing traffic from the user.

Until late 1990s, an edge device employed a high performance general-purpose

CPUs to perform tasks such as header processing, forwarding, table lookups, access

control and implementing the network stack. As another approach, ASICs were

considered to use because they can perform tasks at wire-speed rate. However, the

general-purpose CPUs and ASICs respectively have problems at the point of

performance and flexibility. Furthermore, agile delivery of products is also required

to provide sophisticated communication infrastructures within very limited

time-to-market frames. Network processors (NPs) are now very expected to fill the

need that CPUs and ASICs fail to meet. NPs are programmable engines that are

optimized to perform wire-speed communication. They have made it possible to

significantly improve the performance, flexibility of routers, and even agility of

delivery.

Routers generally can be placed in some edges and backbones. Therefore, NPs

would be adapted to a variety of speed in the Internet. For high-end network, NPs

could be assigned to OC-192 (10Gbps), which delivers 22.6 million minimum-sized

packets-per-second (pps) at the maximum. (Note: Minimum-sized packet is defined

as 64Bytes packet in the paper.) However, generally speaking, we would be required

to employ multiple NPs in order to achieve such high speed. In reality, the main

target of current NPs such as the Intel IXP1200 [1], the Vitesse IQ2000 [2], the

Page 10: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

2

Motorola C-5 [3], and the IBM PowerNP [4] could be OC-48 (2.49Gbps) wire-speed,

which delivers minimum-sized packets at 5.65Mpps, otherwise it could be OC-24

(1.24Gbps), OC-12 (622Mbps) or OC-3 (155Mbps).

Routing is an inherently parallel task in routers because each packet that

traverses a network is a self-contained package with its own destination header and

data payload. Routers have to process each packet independently and out of order,

supervise myriads of packets in parallel, and control such a huge traffic. NPs are

commonly composed of multiple processors, and allow them to perform multiple

threads simultaneously, which would be deeply related to IP routing, data

forwarding, and other header processing. For example, Intel’s IXP1200 includes six

independent RISC processors called Microengines, each supporting four contexts

with hardware multithreading. As a result, IXP1200 can manage 24 completely

independent threads, execute data-intensive tasks of steering packets toward their

destinations on networks in parallel, and hide the long latencies of off-chip memory

references by rapidly switching contexts among threads.

Routers fundamentally have to perform complicated protocol stack with

demands of various services and speed. This paper, first of all, describes Internet

hierarchy and clarifies target markets of NPs in Section2. Then, it characterizes

router processing and workload not only for router system analysis but also for NP

analysis in Section3. Since performance requirements of NPs should be defined on

the basis of a particular realistic workload, four workload models are proposed for

NP simulation based on the real Internet packet data. Four workload models

actually consist of three fixed-size packet workloads (64Bytes, 594Bytes, and

Page 11: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

3

1518Bytes) and one mixture-packet workload including those three packets. It then

introduces the architecture of Microengine and Fast Bus Interface (FBI) unit,

responsible for transferring packets from and to external Media Access Control

(MAC) layer devices, in Section 4. In addition, it especially presents Microengine

pipelining, memory access, and branch and context switch mechanism.

This paper presents some experimental results and evaluation for Microengine.

Section 5 describes the evaluation methodology, and then presents five simulation

results of four proposed workloads with regards to Microengines; 1) Instruction mix

(i.e. distribution of executed instructions), 2) Memory access latency, 3) The ratio of

execution, aborted, stalled, and idle cycles, 4) CPI (Cycle per Instruction), and 5)

Throughput on the fast switching path. In Instruction mix, instruction sets of a

Microengine are categorized into five, and analyzed for dependence of workloads. In

addition, the advantage of context switch is visually proved by the result of memory

access latencies and the ratio of stalled. CPI shows computer architectural

restriction of a Microngine and dependence on workload. Besides, the throughput

presents how much rate only one IXP1200 can achieve for wire-speed

communication. Finally, it introduces other famous NPs and compares with IXP1200

for the context switch and branch mechanism in Section 6, and then concludes

research results in Section 7.

Page 12: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

4

2. Network Configuration and Market for Network Processors

In response to the rapid and extensive growth of Internet traffic, Internet

service providers (ISPs) are experiencing constant demands for expanded services

and network features. Backbone operators are also involved in the demands for

high-speed switching and routing. The Internet's explosive growth is driving

requirements for higher quality, faster connectivity, and more software features for

an ever-growing number of customers. Routers would be deployed at various places

of the Internet. In this section, I present a theoretical image of the Internet, and

clarify the target markets of Network Processors (NPs) in the Internet.

Figure 2-1 depicts the Internet hierarchy divided into five levels. The first level

is the Network Access Points (NAPs) where major Internet backbone operators,

called Network Service Providers (NSPs), interconnect to establish the core concept

of an Internet. NSPs also interconnect at Metropolitan Area Exchanges (MAEs).

Since MAEs serve the same purpose as the NAPs and are privately owned, they are

not shown in Figure 2-1. The second level is the national backbone operators,

sometimes referred to as National Service Providers (NSPs), and the network of

networks spreads out from there. Some of the large NSPs are UUNet, IBM, BBN

Planet, SprintNet, PSINet, etc. The third level of the Internet is made up of regional

networks and the companies that operate regional backbones. Typically, they

operate backbones within a state or among several adjoining states much like the

NSPs. They typically connect to a NSP, or increasingly to several NSPs to be on the

Internet. Some have a connection to a single NAP, and then they extend the network

Page 13: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

5

to smaller cities and towns in their areas. In general, the group of level 1, 2, and 3

can be called Core. The Core is thought of a huge network mixed with Synchronous

Optical Network (SONET)/ Synchronous Digital Hierarchy (SDH), Frame Relay, and

Asynchronous Transfer Mode (ATM), and consisting of Core routers and switches

that correspond to a variety of high-speed network; OC-192/STM64 (10Gbps),

OC-48/STM16 (2.49Gbps), OC-24/STM8 (1.24Gbps), OC-12/STM4 (622Mbps),

OC-3/STM1 (155Mbps), T3/DS3 (45Mbps), T1/E1 (1.5Mbps/2Mbps) and so on. Some

backbone maps can be found at [5], [6], and [7].

The fourth level of the Internet is the individual Internet Service Providers

(ISPs). They lease connections to a NSP, or a regional network operator. An ISP

network usually consists of a number of POPs, which stand for Point of Presences. A

POP is a physical location where a set of Edge and Core routers is located. Therefore,

even though level 4 can be called Edge basically, a part of level 4 could be recognized

as a part of the Core. The Edge routers generally provide individual subscribers

with access to the Core network, and also required to support various speed ranges.

The speed range could be OC-24/STM8, OC-12/STM4, OC-3/STM1, T3/DS3, T1/E1

and so on. The fifth level of the Internet is the consumer and business market, and

basically includes Access routers, that connect a customer to a POP of ISPs, and

Customer routers, that can be connected to end points of Internet. The required

speed of those routers is much less than that of Edge routers and Core routers. Since

big enterprises work much like ISP once in a while, they may also have a Core

router and an Edge router to connect to some branch offices.

In fact, the main target application of NPs would be Edge routers and Core

Page 14: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

6

routers/switches. Those routers require more powerful processing capability and

flexibility than Access and Customer routers. Therefore, NPs obviously need to

achieve wire-speed on the optical network level and have programmability,

flexibility, and scalability.

Note: Level1-3: Core, Level4: Edge, and Level5: Consumer and Business

Figure 2-1. Internet Hierarchy

Level 2

Level 3

Level 4

POP

POP

Enterprise

Level 5

……

NSP

RegionalNetwork

RegionalNetwork

RegionalNetwork

RegionalNetwork

POPPOP

POP

POP

ConsumerBusiness

Business

Consumer

Consumer

…Business Core Routers

Core Switches

Server

Edge Routers

Access Routers

ISP

ISP

ISPSmall office

Level 1NAP

NSP

Customer Routers

Page 15: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

7

3. Router Processing and Workload Characterization

3.1 Router Processing

Routers are the most common network layer devices in the Open Systems

Interconnection (OSI) seven layers model. Routers are connected to at least two

networks and decide which way to send each information packet based on its current

understanding of the state of the networks it is connected to. Routers create or

maintain a table of the available routes and their conditions, and use this

information along with distance and cost algorithms to determine the best route for

a given packet. Typically, packets may travel through a number of network points

with routers before arriving at its destination. Routers actually support a variety of

functions in addition to IP routing. In reality, router’s functions would depend on the

specifications of routers vendors make. However, the fundamental functions covered

by most of routers can be generalized. To assess NPs processing performance, we

should especially focus on fast path processing and characterize router processing

basically executed by software in most cases. A good reference to components of

Routers is found in [8]. This section characterizes fundamental router processing

based on it.

Page 16: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

8

Figure 3-1. Router Processing on Fast Path

Figure 3-1 depicts router processing based on fast path (i.e. forwarding path).

First of all, Input Scheduler (IS) manages input port sharing and gets a packet from

an input port. The received packet could be once placed into a receive FIFO (RFIFO).

In reality, this packet comes through Physical layer device (PHY) and Medical

Access Control (MAC) device with framing and error detection at data link layer

until getting to IS. After that, the packet is parsed and then Classifier (CF) chooses

an appropriate Receive Packet Buffer (RPB) assigned to a Forwarder (FW) based on

certain fields in the packet header. In general, different FW can be applied to

incoming packets according to different protocol, service type, priority, flow control

and so on. Most of data link protocols have some sorts of protocol identifier fields

that can be used to select FW on a specific interface. For example, the type field in

Page 17: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

9

Ethernet and the Logical Link Control (LLC)/Subnetwork Access Protocol (SNAP)

defined in IEEE802.2 can be frequently used for identification techniques, not

simply the LAN protocols. In addition, IP option and type of service (TOS) fields of

IPv4, and priority and flow label fields of IPv6 could be applied for classification.

The classification decision can be made on fields generally associated with OSI

Layer2 through 4.

Routers typically provide access control mechanisms for permitting or denying

the flow of packets. Since the router parses a packet in the process of classification,

it can get necessary information for Filter (FL) at the same time, perform FL

operation, and discard some irrelevant packets before they are forwarded. There are

various filtering operations in OSI Layer2 and 3. The Layer2 FL could permit or

deny forwarding based on a MAC source and/or destination address, protocol type,

Ethernet vendor code, or LLC information. The typical FL parameters of Layer3

include Layer 3 source and/or destination addresses, either explicitly or after a

wildcard mask is applied. Other parameters include IP protocol type, TOS/IP

precedence bits, and TCP and UDP port values. The latter parameters are actually

Layer 4 information, but commonly specified in a Layer 3 context.

In fact, this sort of FL can be put before or after FW or both. In particular, the

FL positioned before/after FW is respectively defined as Inbound Filter/Outbound

Filter. Inbound FL action can be applied to all incoming packets, but Outbound FL

can be applied to some specified packets. Even though we could say that Outbound

FL is more efficient than Inbound FL, their specification depends on router makers.

In Figure 3-1, the FL implies Inbound Filter.

Page 18: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

10

The forwarder (FW) picks packets out of Receive Packet Buffer (RPB). In

general, FW manipulates the TTL and checksum fields of the IP header, performs IP

lookup based on forwarding table, modifies the data link level header and IP header,

and delivers the packet toward output ports. Routers commonly have two key data

structures for lookup table; Routing Information Base (RIB) and Forwarding

Information Base (FIB). RIB is optimized for updating by the dynamic routing

information mechanisms such as Routing Information Protocol (RIP), Interior

Gateway Routing Protocol (IGRP), Enhanced Interior Gateway Routing Protocol

(EIGRP), Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP). On

the other hand, FIB is optimized for high-speed lookup and packet forwarding. RIB

is not illustrated in Figure 3-1 because RIB is not included in fast path. FIB is

expected to have efficient data structures, algorithms, and once in a while

hardware-assisted lookup for rapid forwarding. The fundamental data structure of

FIB can be some sort of hash table or tree look-up table, including forwarding

information such as prefix and next hop.

After forwarding, Queuing Assignment (QA) puts outbound packets into

Transmit Queuing Buffer (TQB) corresponding to an output port. Routers basically

must implement some queuing discipline that governs how packets are buffered

while waiting to be transmitted. The queuing algorithm is composed of scheduling

discipline and drop policy. The simplest queuing algorithm is First in First Out

(FIFO) queuing with tail drop policy. Tail drop means that packets arriving at the

end of FIFO are dropped if the FIFO is full. A simple variation of FIFO queuing is

then priority queuing. In this case, the router implements two kinds of FIFO

Page 19: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

11

queues; priority and non-priority queue. For example, the priority could depend on

the type of Service (TOS) in the IP header. Additionally, Fair queuing (FQ) algorithm

evenly maintains multiple queues for each flow currently being handled by the

router. The router actually serves these queues in a round-robin manner.

In reality, the router could use more sophisticated strategies such as Weighted

Fair Queuing (WFQ). WFB assigns a weight to each flow, namely queue. This weight

logically specifies how many bits to transmit each time the router services that

queue, which effectively controls the percentage of the link’s bandwidth that flow

will get. It also could be implemented on classes of traffic like TOS in the IP header.

Output Scheduler (OS) selects one of non-empty TQB, transfers a packet into

Transmit FIFO (TFIFO), and then sends it to the associated output ports. With FQ,

the output scheduler checks TQB in round robin manner and delivers the packet.

The scheduler generally performs no processing on the packet. However, suppose

multiple paths to the same destination exist, Load Balancing (LB) could be

employed on output scheduler. LB optimizes the use of bandwidth and recovery time

after link or interface failures by preconditions. LB is based on round-robin,

per-packet, per-destination, source-destination hash, and so on.

3.2 Workload Characterization and Proposal

Generally speaking, real Internet traffic includes various sizes and types of

packets, which generate network load and affect routers’ processing ability.

Therefore, when testing routers, it is very important to consider what kinds of

Page 20: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

12

packets occur most frequently in real Internet stream.

In this section, I characterize workloads for NP evaluation and propose four

workloads for testing Microengine based on real data of Internet measurements,

which are collected by the Measurement and Network Analysis Group from the

National Laboratory for Applied Network Research (NLANR) project located at San

Diego Supercomputer Center [10]. The reference data were gathered during the

month of February, 2001 under the National Science Foundation Cooperative

Agreement No.ANI-9807479, and NLANR. In addition, the real data are used by

router tester makers, such as Agilent Technologies [11]. The NLANR Measurement

and Network Analysis Group is actually monitoring real Internet packets and

recording them every day. The raw data can be found on the NLANR web site at

[12].

Briefly summarized, a total of 342 million packets were sampled and recorded

at the network monitor site during this period. The average packet size was

402.7bytes, with the following packet sizes and types occurring most frequently

(Table 3-1).

Table 3-1. Frequently occurred packets in the real Internet

Packet

Size

Packet Type Description Packets

Distribution

Internet

Traffic

1) 40

Bytes

TCP packets with IP header but no payload (i.e. only

20 Bytes IP header plus 20 Bytes TCP header),

35% 3.5%

Page 21: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

13

typically sent at the start of a new TCP session.

2) 576

Bytes

The default IP Maximum Datagram Size (MDS)

packets without fragmentation, including the default

TCP Maximum Segment Size (MSS) 536 Bytes

packets.

11.5% 16.5%

3) 1500

Bytes

Packets corresponding to the Maximum Transmission

Unit (MTU) size of an Ethernet connection.

10% 37%

40 Bytes packets are generally used for three-way handshake of TCP connection

establishment or termination. These packets are very often delivered in the Internet,

and expected to give big CPU load for routers. However, since these packets are

small, they represent only 3.5 % of the Internet traffic.

IP packets can be logically set up to 65, 536 Bytes as a maximum length. But

there have been a long established rule on RFC879 [13]. Hosts must not send

datagrams larger than 576 bytes unless they have specific knowledge that the

destination host is prepared to accept larger datagrams. As a result, basically the

default IP Maximum Datagram Size is 576 Bytes, which consists of IP header (20

Bytes), TCP header (20 Bytes) and the TCP Maximum Segment Size (MSS) (536

Bytes). Although the distribution of these packets is smaller than 40 Bytes packets,

the Internet traffic load is larger than them due to the size of packets.

Ethernet is very popular packets format handled by routers. From the result of

Table 3-1, it turns out that Packets corresponding to the Maximum Transmission

Unit (MTU) size of an Ethernet connection occupy the Internet traffic considerably

Page 22: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

14

because of the size. Several other packet sizes occurred more frequently than normal,

where normal is defined as more than 0.5% of all packets; for example 52, 1420, 44,

48, 60, 628, 552, 56, and 1408 Bytes.

Four sorts of packets are proposed as workloads in this paper. First of all, I

propose three workloads of the fixed-size packet streams based on Table3-2. Each

workload is formatted as Ethernet packets based on three sizes of the most

frequently occurred packets in real Internet. In other words, those workloads

represent packets of table 3-1 encapsulated by Ethernet header and trailer. 64 Bytes,

594 Bytes, and 1518 Bytes Ethernet packet workloads are provided for Microengine

simulation on the assumption that an IXP1200 built-in router has 16 x 100Mbps

Ethernet ports. A 64 Bytes Ethernet packet actually includes 6 Bytes padding data

in addition to 14 Bytes Ethernet header, 20 Bytes IP header, 20 Bytes TCP header

and 4 Bytes Ethernet Trailer because the shortest frame length of Ethernet is

decided as 64Bytes.

In addition, I propose a simple mixture of three packets as fourth workload.

Some router manufacturers commonly use this mixture as a “quick and dirty”

approximation of the Internet packet mixture. Table 3-3 shows the proposed mixture

ratio of packet size. I tried to approximate the traffic load of 64 Bytes and 1518

Bytes to the value of 40 Bytes and 1500 Bytes shown in Table 3-1. Regarding 594

Bytes packets, I regard it as a representative of other size packets between 64 Bytes

and 1518 Bytes. It has an average packet size of 406 Bytes. Suppose we assume that

packets of Table 3-1 are simply formatted as Ethernet packets, the average packet

size is 420.7 Bytes. Therefore, we can expect that the proposed workload can be

Page 23: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

15

correlated with the realistic Internet traffic very closely (A correlation value: 0.965).

Table 3-2. Workloads of Fixed size packets

Packet Size Packet Type Description

1) 64 Bytes The minimum-size Ethernet packets, consisting of 14 Bytes Ethernet header,

20 Bytes IP header, 26 Bytes Payload, and 4 Bytes Ethernet trailer (FCS), and

being expected to be used for TCP handshake

2) 594 Bytes Ethernet packets including 14 Bytes Ethernet header, 20 Bytes IP header, 556

Bytes Payload (assuming 20 Bytes TCP header plus 536 Bytes MSS), and 4

Bytes Ethernet trailer (FCS)

3) 1518 Bytes The maximum-size Ethernet packets, consisting of 14 Bytes Ethernet header,

20 Bytes IP header, 1480 Bytes Payload and 4 Bytes Ethernet trailer (FCS)

Table 3-3. Workload of Internet Packets Mixture

Packet Size (Bytes) Proportion of Total Traffic Load

64 50% (6 parts) 7.881%

594 41.7% (5 parts) 60.96%

1518 8.3 % (1 parts) 31.158%

3.2 Pseudo Code of Router Processing

This section presents pseudo-code for the loop programs executed by each

Page 24: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

16

Microengine in this research simulation, and makes clear the router processing

explained in Figure3-1. Even though IXP1200 has six Microengines and allows total

twenty four threads to run on it, four Microengines dedicate themselves to the

receive processing and two Microengines focus on the transmit processing in the

simulation.

The receive processing is actually defined as from when a packet comes into an

input port and until the packet is enqueued into Transmit Queuing Buffer (TQB).

The transmit processing is regarded as from when the packet is dequeued and until

the packet is sent to an output port. In fact, the transmit functions are split across

two Microengines. Each transmit Microengine contains a scheduler thread and

three transmit threads.

Figure3-2 through 3-4 presents pseudo code for receive and transmit processing.

In reality, the code is composed of a variety of function codes. In Appendix A, some

substantial segmented pseudo codes are presented as reference. The pseudo codes

are actually simplified a bit and context switch descriptions are eliminated from

them for easy comprehension of router processing.

In the code, an italic word means a register. If nothing is added to it, it just

represents general purpose register (GPR) with context-relative addressing mode

that is specified to only one thread and can’t be read or written by other threads in a

Microengine. Even if only “ $ “ is added, it displays context-relative SRAM transfer

register. Similarly, only “ $$ “ represents context-relative SDRAM transfer register.

Suppose “ @ ” is placed in front of them, the register is regarded as absolute register,

that can be read or written by any one of the four threads executing in a

Page 25: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

17

Microengine and distinguished from context relative register. Besides, the thread

can access control status register (CSR) in Fast Bus Interface (FBI) unit for packet

processing. CSR_ denotes CSR register. Additionally data terminology of IXP1200 is

expressed by quadword: 64bits, longword: 32bits, word: 16bits, and byte: 8bits.

(Note: Microengine and FBI architecutures are described in Section 4.)

3.2.1 Receive Packet Processing

Figure3-2 indicates pseudo code for receive processing main loop. This code is

assigned to each of 16 threads, and each thread is bound to a specific port number

and receive FIFO (RFIFO) element number. Since there are 16 ports and Receive

FIFO has 16 Elements (Each element has 64bytes for incoming packet and 16 bytes

for extended data and status) in IXP1200, Receive Thread number 0 to 15 is equally

assigned with the Port number 0 to 15 and the RFIFO element number 0 to 15.

First of all, Input Scheduler checks receive ready flags, that indicate a packet is

ready in external Media Access Control (MAC) device, by reading REC_RDY register

(Note: Ready flags are elaborated in Section 4.2). It then issues receive request to

FBI by setting rec_req into REC_REQ register (Note: rec_req should be prepared in

initialization process). Each function uses semaphore so that only one receive thread

can read receive ready flags, and only one receive thread posts a receive request at a

time.

Once FBI starts to receive a packet from the MAC device, a start_receive signal

is asserted to the receive thread and inform that the packet data is in the RFIFO

Page 26: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

18

element, and then receive_status reads the control information from RCV_CTL

register and set necessary information into rec_state and exception register.

RCV_CTL contains start-of-packet (SOP) and end-of-packet (EOP) assertions and

error indications from the MAC. The receive thread then allocates a packet

descriptor and buffer by a SRAM pop operation. When the allocation is performed,

the function needs some parameters such as packet buffer base address

(PKBUF_BASE), buffer size (PKBUF_SIZE), descriptor base address (DESC_BASE),

and descriptor size (DESC_SIZE). In addition, the thread checks if the receive port

has fail or error by the content of exception register and increment exception

counter.

From the result of receive_status, suppose rec_state register contains SOP bit,

the thread reads MAC packet header from the RFIFO into SRAM transfer register

named $pkt_buf, and extracts 2bytes protocol/length field. In addition, parse_packet

classifies the packet into three link types 1) Ethernet, 2) 802.3 and LLC, 3) 802.3

and LLCSNAP based on protocol/length field, and then extracts ethertype indicating

upper layer packet type such as IP and ARP. The pkstate register contains packet

status of the link type and packet discard decision. A router typically could have

multiple and different kinds of forwarders corresponding to different service,

priority, and protocol. Even though this loop program uses ethertype just for

filtering and handles only one type of forwarder, ethertype could be used for

selecting different forwarders in general. I depict the example pseudo code of

classifier for different forwarders in Appendix A.

Once the thread finishes to classify the packet, it puts ethertype and all header

Page 27: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

19

information into etherfilter and then filters the packet based on port configuration.

The thread discards the packet according to the result of pkaction register.

Otherwise, forwarder process is invoked.

In Forwarder, first of all, get_IP_header transfers IP header into $pkt_buf_IP,

and the thread extracts IP version field, whose position depends on packet link types.

Suppose the packet is IP version 4 without options, the thread directly transfers

remained payloads from the RFIFO into the buffer, then checks total length, TTL

(Time to Live), and checksum in the IP header, and finally outputs the result into

exception register. After that, it decrements TTL by 1 and modifies checksum value

because of the change of TTL. These functions are packed into xferpayload_&

_iphdrchck_&_modify. Even if the packet is not version 4 or has options, the packet

is transferred to buffer and then enqueued into core stack interface queue so that

StrongARM can process the packet. Hence, the thread sets core stack interface bit in

output_intf representing output queue interface.

Ip_trie5_lookup supports a dual lookup of direct entry table and trie block

lookups in SRAM. It searches the best matching prefix and gets a route pointer,

namely index to route entry in SDRAM, from IP destination address. The route

pointer is actually stored in rt_ptr register. The thread then acquires the forwarding

information such as destination MAC address and output port number from the

SDRAM route entry to $$dxfer register based on rt_ptr. After that, the IP header is

modified on the basis of the forwarding information and then the modified header is

written to buffer and prepended to the payload already in SDRAM.

Suppose rec_state is not SOP, in other words continuous packet data is coming

Page 28: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

20

in the RFIFO, port_rx_bytecount_extract extracts last packet's byte count from

rec_state, and it is changed to quad count. The thread updates target buffer address

and then transfers 64bytes data from RFIFO to the buffer

If the packet is EOP and not discarded, the descriptor is maintained in SRAM.

In addition, the packet is enqueued based on descriptor SRAM address (desc_addr),

output interface indicating transmit queue (output_intf), type of queue like linked

list, circular, etc (Q_TYPE), location in Scratchpad for packet present indication

called Port with Packets (PWP) (Q_RDY), base address of queue (Q_BASE), base

address of descriptor buffers (DESC_BASE). In the enqueue_packet, the transmit

queue is practically locked just before enqueuing so that other threads can’t access it.

The thread then sets PWP bit (only one bit per port) in Scratchpad because transmit

thread can figure out that the packet is prepared to send toward output port, and

finally unlock the transmit queue. In case that rec_state contains discard bit, the

thread sets packet buffer address to the beginning of the packet in order to discard

the packet and reuse the buffer.

RECEIVE_THREAD_MAIN_LOOP:

// Input Scheduler

receive_ready_check()

receive_request(rec_req)

(rec_state, exception) = receive_status()

mpacket_received:

if (pkbuf_addr == UNALLOCATED)

(pkbuf_addr, desc_adder) = pkbuf_allocate (PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

end if

port_rx_fail_error_check(exception)

Page 29: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

21

// Classifier

if (bit(rec_state, REC_STATE_SOP_BIT)) // if SOP

(proto_len, $pkt_buf) = get_mpkt_header()

(pkstate, ethertype) = parse_packet(proto_len)

// Filter

pkaction = etherfilter(ethertype,$pkt_buf)

if (pkaction == PKT_DENY)

pk_late_discard(rec_state, rec_req, exception)

else

// Forwarder

$pkt_buf_ip = get_IP_header()

ip_version = IP_version_check(pkstate,$pkt_buf_ip )

if (ip_verslen == IPV4_NO_OPTIONS)

(exception,ipdest) = xferpayload_&_iphdrchck_&_modify(pkbuf_addr,rfifo_addr,pkstate)

if (exception)

pk_late_discard(rec_state, rec_req, exception)

// Lookup

else

rt_ptr = ip_trie5_lookup(ip_dest, SRAM_ROUTE_LOOKUP_BASE)

copy $$dxfer <- DRAM(addr(router_base + rt_ptr), size(3quadwords))

write_modified_IP_Ether_header($$dxfer)

end if

else // IP with options or frag

copy RFIFO(addr(rfifo_addr + QWOFFSET0), size(8quadwords))

-> DRAM(addr(pkbuf_addr + QWOFFSET0)

output_intf = CORE_STACK_INTF1 <<3

end if

end if

else // not SOP

current_bytecount = port_rx_bytecount_extract(rec_state)

current_qwcount = current_bytecount >> 3

pkbuf_addr = pkbuf_addr + 8

copy RFIFO(addr(rfifo_addr + QWOFFSET0), size(current_qwcount)) -> DRAM(addr(pkbuf_addr + QWOFFSET0)

Page 30: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

22

endif // not SOP

xbuf_free($pkt_buf)

if (bit(rec_state, REC_STATE_EOP_BIT)) // if EOP

if (!bit(rec_state, REC_STATE_DISCARD_BIT))

$desc_buf = update_describtor()

copy $desc_buf -> SRAM(addr(desc_addr + LWOFFSET0), size(2lword))

// Enqueue

enqueue_packet(desc_addr, output_intf, Q_TYPE, Q_RDY, Q_BASE, DESC_BASE)

pkbuf_add = UNALLOCATED

xbuf_free($desc_buf)

else

pkbuf_addr = buf_dram_addr_from_sram_addr(desc_addr,PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,

DESC_SIZE)

end if

rec_state = 0

end if

xbuf_free($$dxfer)

goto RECEIVE_THREAD_MAIN_LOOP

Figure3-2. Pseudo Code of Receive Thread Main Loop

3.2.2 Transmit Packet Processing

As described, a transmit packet processing consists of a transmit scheduler and

three transmit threads that move a packet from transmit buffer to transmit FIFO,

and each one is allocated to one of four threads in each Microengine. Figure 3-3

presents the pseudo code of the transmit scheduler main loop and the segmented

pseudo codes. First of all, the scheduler reads Ports with Packets (PWP) from the

Scratchpad address specified by pwp_addr to $pwp SRAM transfer register for

Page 31: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

23

polling. The PWP information is aggregated with the values of @local_pwp register,

which locally presents current status of PWP and is modified by all transmit threads.

The scheduler then creates three transmit assignments by tx_assign. @assign#

means absolute GPR used as mailboxes to hold the transmit requests for the

transmit thread. In fact, the number “#” is corresponding to the number of the

transmit thread (1 to 3) in a Microengine. The scheduler also checks each port for

enqueued data sequentially. If there are any outbound packets queued for

transmission on that port, new assignment is set into @assign# and target port

number is incremented for next check. However, before the transmit assignment is

updated, the scheduler is required to wait on semaphore until the transmit thread

becomes idle. The valid bit (31-bit) of @assign# is actually used as semaphore in

order to share it with other transmit threads. If next port does not have any queued

packets, it just sets the skip bit and updates target port number.

// Scheduler

TRANSMIT_SCHEDULER_MAIN_LOOP:

copy $pwp <- Scratch(addr(pwp_addr), size(1bit))

aggregate_pwd = $pwp | @local_pwp

(target_port, @assign1) = tx_assign(target_port, @assign1, aggregate_pwp , SKIP_BIT, PORT_INCR)

(target_port, @assign2) = tx_assign(target_port, @assign2, aggregate_pwp , SKIP_BIT, PORT_INCR)

(target_port, @assign3) = tx_assign(target_port, @assign3, aggregate_pwp , SKIP_BIT, PORT_INCR)

goto TRANSMIT_SCHEDULER_MAIN_LOOP

Page 32: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

24

//******************************** Segmented Pseudo code ***********************************

//************** Format of Transmit Assignment ************

RES: Reserved

//*********************************************************

tx_assign(target_port, @assign#, aggregate_pwp , SKIP_BIT, PORT_INCR)

{

new_target_port = target_port + PORT_INCR

if ((1 && aggregate_pwp >> (target_port, bit_enable(LS5bit)) > 0)

new_assignment = target_port

goto tx_assign

else

new_assignment = new_assignment | 1 << SKIP_BIT

end if

tx_assign:

sem_wait(@assign#) // if semaphore set, exit

@assign# = new_assignment

target_port = new_target_port

}

sem_wait(@assign#)

{

begin:

if (@assign# < 0) // watch bit31(semaphore), if set, then exit

goto end

else

goto begin

end:

}

//******************************************************************************************

Figure3-3. Pseudo Code of Transmit Scheduler Main Loop

31 30:9 8 7:4 3:0

Valid RES Skip Port RES

Page 33: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

25

Figure 3-4 shows the pseudo code of the transmit thread main loop. When the

last packet is transferred into TFIFO, the transmit assignment (@assign#) are

inverted and set semaphore to allow the scheduler to input the new transmit

assignment in it. Tx_assignment_read waits until the scheduler sets the next

transmit assignment for the thread and simultaneously flips the semaphore bit in

tx_assignment_read. Once the new assignment is set, the transmit thread gets it

and extracts port number (port), skip flag (skip_flag), TFIFO elements (tfifo_entry),

and the transmit queue offset (q_offset). In the IXP1200, there are 16 TFIFO

elements specific to output ports, and each contains 64bytes for outbound packets

and 16bytes for control and prepend field. In the code, the TFIFO element is actually

the same as the port number. Since the transmit thread is assigned to two

Microengines, one Microengine takes even TFIFO elements (total 8), the other takes

odd TFIFO elements (total 8).

If skip_flag is not set, the thread extracts port information such as ele_

remaining, buf_offset, bank and last_mpkt_byte_cnt from global-port-in-progress

registers (@port_inprog0 -7) associated with each port. Ele_remaining indicates the

number of remaining elements in an outbound MAC packet. Buf_offset denotes

offset from the top of the transmit packet buffer to the start of the valid data. Bank

indicates SDRAM bank that the packet is in. Last_mpkt_byte_cnt denotes byte

enable for the last packet. If the last packet has been completely transferred to the

TFIFO, namely ele_remaining equals 0, the thread locks the SRAM transmit queue

so that other threads can’t access, reads 2 long words queue descriptor for the next

packet from it, and gets head and tail pointers of link lists which configures linked

Page 34: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

26

buffers. In addition, it reads 2 long words packet link list and extracts next

ele_remaining, last_mpkt_byte_cnt, bank, and buf_offset for the packet. The thread

then updates the queue descriptor. First of all, it decrements the packet count by 1

and sets it in the SRAM transfer register ($q_desc1). The packet count shows the

number of packets in the transmit queue. If the packet count is more than 0, the

thread changes existing tail pointer to new head pointer and merges it with new tail

pointer. The merged head and tail pointers are set in the SRAM transfer register

($q_desc0). Finally, those descriptor values ($q_desc0 and 1) are written into the

transmit queue descriptor and then the queue is unlocked.

Suppose ele_remaining equals 1, it suggests that only one 64 bytes packet is

queued. In fact, tx_last_mpkt directly moves the packet from SDRAM to the TFIFO.

Tx_status_set prepares the control information containing paramenters such as port

number and EOP and SOP flags into $tfifo_ctl_wd0 register. Then the tfifo_validate

writes it to the control field of the TFIFO. In addition, the tfifo_validate reads

transmit pointer and transmit ready flags from FBI until transmit pointer is equal

or one less than the current TFIFO element. The transmit pointer is continuously

maintained by the Transmit State Machine (TSM) in the FBI, and points to the

TFIFO element that the TSM expect to send next. Transmit ready flags indicate

that ports will accept data. Then, if transmit port is ready, the transmit thread sets

“Pass” into return_status register and the valid flag in FIFO. When the SDRAM

transfer is complete, the SDRAM controller also sets a valid bit into TFIFO control

Field. When both valid bits are set, the TSM commence transfer of the data from the

TFIFO to the MAC device. If the result is fail, the thread sets skip bit in TFIFO

Page 35: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

27

control field.

Suppose return_status displays “Pass”, the thread releases the packet

descriptor and buffer. Otherwise, the thread saves remaining elements, last packet

byte count, and buffer offset in @port_inprog0 –7. Moreover, tx_portvect_modify sets

@local_pwp based on port number.

Even if the packet is placed in SOP position or EOP position or between them,

the thread similarly performs packet transferring, control field set, and data

validation and the check. The first ele_remaining is 0 and the second ele_remaining

is more than 2, the packet is regarded as SOP but not EOP. Even if the first one is 0

and the second one is 1, the packet is EOP packet. If the first one is 0 and the second

one is not 1, the packet is stored between SOP and EOP. Only if the EOP is

transferred to TFIFO, @local_pwp has to be cleared.

If the transmit scheduler sets the skip bit in the Transmit Assignment, the

transmit thread is responsible for ensuring that the TSM skips over the assigned

TFIFO element. To do this, the transmit thread issues a null SDRAM transfer to

force the SDRAM controller to set the TFIFO valid bit. The transmit thread then

sets the skip bit in the TFIFO control field and sets its valid bit. This procedure

allows the TSM not to transfer the data in this TFIFO element, but to go on to the

next one.

TRANSMIT_THREAD_MAIN_LOOP:

@assign# = ~(@assign#)

(q_offset,port,skip_flag,tfifo_entry) = tx_assignment_read()

process_assignment:

Page 36: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

28

if (skip_flag != SKIP_BIT_SET)

(ele_remaining,buf_offset,bank,last_mpkt_byte_cnt) = tx_portinfo_restore(port,@port_inprog0-7 )

if (ele_remaining == 0) // if no elements left in last packet

($q_desc0, $q_desc1, $pkt_link0, $pkt_link1, tail_ptr, ele_remaining, bank, buf_offset,

last_mpkt_byte_cnt) = tx_pktlinklist_read(q_desc_base, q_offset, buf_desc_base)

tx_pktlinklist_update($q_desc0, $q_desc1, q_desc_base, tail_ptr, q_offset, $pkt_link0, port)

if (ele_remaining == 1) // if sop and eop

tx_last_mpkt_xfr(bank, buf_offset, last_mpkt_byte_cnt, tfifo_entry, pkt_buff_base)

$tfifo_ctl_wd0 = tx_status_set(last_mpkt_byte_cnt, EOP_AND_SOP, port)

return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)

if (return_status == PASS) //if good validate

tx_ll_buf_free($q_desc0, bank, buf_desc_base, DESC_SIZE, 16)

else //could not validate

(@port_inprog0-7) = tx_portinfo_sop_save()

tx_portvect_modify(@local_pwp , port, 1)

end if

else // sop, but not eop

tx_portvect_modify[@local_pwp , port, 1]

tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8) // 64 bytes transfer

$tfifo_ctl_wd0 = tx_status_set(const_0, 0xfd, port)

return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)

if (return_status == PASS)

(@port_inprog0-7) = tx_portinfo_save()

else

(@port_inprog0-7) = tx_portinfo_save_no_decr()

end if

end if

else // not sop

if (ele_remaining == 1) // if NOT SOP, but EOP

tx_last_mpkt_xfr(bank, buf_offset, last_mpkt_byte_cnt, tfifo_entry, pkt_buffer_base)

$tfifo_ctl_wd0 = tx_status_set(last_mpkt_byte_cnt, 0x2, port)

return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)

if (return_status == PASS)

tx_ll_buf_free(buf_offset, bit20on, bank, buf_desc_base, DESC_SIZE, 3)

Page 37: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

29

tx_portinfo_update() //clear local pwp bit for this port

tx_portvect_modify(@local_pwp , port, 0) // clear bit number "port"

end if

else // NOT SOP and NOT EOP

tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8)

$tfifo_ctl_wd0 = tx_status_set(const_0, 0xfc, port) //no eop, no sop

return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)

if (return_status == PASS)

tx_portinfo_update()

end if

end if

end if

transmit_done:

else // given a "skip tfifo element" assignment

tfifo_element_skip_nordy(tfifo_entry, pkt_buffer_base)

end if

goto TRANSMIT_THREAD_MAIN_LOOP

Figure3-4. Pseudo Code of Transmit Thread Main Loop

Page 38: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

30

4. Network Processor Architecture

The network processor typically covers the fundamental functions of routers and

improves speed in the hardware architecture. The network processor has

programmability to enable easier migration to new protocols and technologies

without requiring new ASICs. Therefore, the network processor generally

incorporates multiple general purpose processors and special hardware-assist

engines, such as hashing, tree structure, check sum data, filtering, and classifier for

security, and so on. Figure 4-1 presents an overview of the architecture of the Intel

IXP1200 network processor.

The IXP1200 is composed of a Strong ARM microprocessor, six independent

32-bit RISC engines (Microengines) with hardware multithread support, standard

memory interfaces, and high-speed bus interfaces with Media Access Control (MAC)

layer devices and PCI. It can replace the host processor and all of ASICs in the ASIC

based router system. The programmable Microengines make it easy to add new

functionality by software update instead of hardware modifications. The high-speed

bus interface for packet transferring is called Internet Exchange (IX) bus interface

and provided by Fast Bus Interface (FBI) unit. FBI also includes scratchpad RAM

and hash unit to support packet processing. In addition, IXP1200 can connect to

SDRAM to store packets coming from MAC device and SRAM to store heavily used

data structures like the FIB lookup tables. Requests of access to SDRAM or SRAM

are queued in the SDRAM and SRAM units respectively by executing a specific

reference instruction. Each Microengine can directly access to the SDRAM unit,

Page 39: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

31

SRAM unit, and the FBI unit via two separate 32 bit internal buses. Besides, PCI

bus also could be used to connect external MAC devices instead of IX bus. In the

following sections, Microengine and FBI unit are described in more details.

IntelStrong ARM

Core16 KbyteI-cache

8 KbyteD-cache

512 KbyteMini-Dcache

Write-Buffer

Read Buffer

JTAG

PCI Unit

32-bit bus

UART 4 TimersGPIO RTC

SRAM Unit32-bit bus

SDRAM Unit

FBI Unit

ScratchpadMemory(4 Kbyte)

64-bit bus

Micro-engine

1

Micro-engine

2

Micro-engine

3

Micro-engine

4

Micro-engine

5

Micro-engine

6

64-bit bus

Notes: 32-bit Data Bus32-bit ARM System Bus

IX BusInterface

Hash Unit

IntelStrong ARM

SA-1Core

Figure 4-1. Architecture of the Intel IXP1200

4.1 Microengine Architecture

The six Microengines can perform total 24 threads associated with fast path

processing of routers without support from the StrongARM core. Each Microengine

has four independent program counters, hardware support for very low overhead

Page 40: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

32

context switching (minimum overhead: 1cycle), a programmable 4kbytes instruction

memory, 128 32-bit general purpose registers, 128 32-bit transfer registers and an

ALU and shifter that is capable of performing an ALU and shifter operation in a

single cycle. The instruction set was specifically designed for networking

applications. Figure 4-2 depicts an overview of a Microengine.

Figure 4-2. Microengine Architecture

The 128 general purpose registers can be addressed by using relative or

absolute addressing. Relative addressing divides the registers into the four threads,

ContextArbiter

and Event

Processor

Other CSRs

Prgrm ctr 0

Prgrm ctr 1

Prgrm ctr 2

Prgrm ctr 3

Instr Decode

MicroprogramControl Store

CommandReference FIFO

32 SDRAM REGSREAD XFER

32 SRAMREGSREAD XFER

32 SRAMREGSWRITE XFER

32 SDRAMREGSWRITE XFER

64 A – Side RegsGeneral Purpose

64 B – Side RegsGeneral Purpose

Shifter

A side mux B side mux

uEngineController

Commands To Other Functional UnitsTo SRAM DATA and other writeable destinationsFrom SRAM and other readable sourcesTo SDRAMFrom SDRAM

ALUMicroengineInternal structure

Event Signals

Page 41: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

33

absolute addressing allows a register to be shared. The registers are single-ported

only and divided into two banks in order not to impair performance.

The transfer registers are used to store data that has been read in from memory

or a device on the IX Bus and data that is going to be written out to memory or a

device on the IX bus. The transfer registers are divided into multiple banks as well.

There are 32 SRAM read, 32 SRAM write, 32 SDRAM read, and 32 SDRAM write

registers in a Microengine. A Microengine can issue a data transfer between

multiple transfer registers and the SRAM or SDRAM with a single command.

An ALU and shifter perform standard arithmetic and logic functions, plus some

unique instructions that are useful in packet processing. Because of this, the

Microengine is capable of accomplishing sophisticated packet processing in a single

instruction that would take several instructions in a general RISC processor. Each

Microengine contains a set of control and status registers. The StrongARM core uses

these registers to program, control, and debug the Microengines. The instructions

used in the Microengine can be classified into five categories, 1) Arithmetic, Rotate

and Shift Instructions, 2) Branch and Jump instructions, 3) Reference instructions,

4) Local register instructions, 5) Miscellaneous instructions. In appendix B, the

instruction set of a Microengine is described in detail.

The Microengine maintains four program counters. Only one of these may be

active at any given time. This enable the Microengine to keep track of four separate

threads, or executing processes. Threads can use and share the same code in

program store, or they can have separate code, or some of each. The register groups

are all broken into four separate sets, so that each thread can easily maintain its

Page 42: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

34

own context. A running thread must voluntarily suspend itself for another thread to

start. This is called cooperative multitasking. The thread will normally swap itself

out while it is waiting for something external to it to occur, for example a read from

SDRAM to be complete. To accomplish this, the programmer just indicates what

condition the thread is waiting for and tells the process to swap. The Microengine

controller will then move onto the next thread that is ready to run. It will eventually

swap out, and if no other thread is ready to run, the first thread will begin running.

As such, the Microengine can hide long latencies caused by referencing off-chip

memory.

4.2 FBI Unit Architecture and IX Bus Interface

The FBI Unit contains receive and transmit FIFO buffer (RFIFO and TFIFO),

4kbytes scratchpad RAM, Push and Pull Engine corresponding to 8 entry Pull

command queue and 8 entry Push command queue, and 48 or 64-bit hardware hash

unit with 8 entry hash command queue. In fact, RFIFO and TFIFO can be a

misnomer because they are memories that can be accessed in any order. They can

act as a circular buffer that is traversed by the pointer. Each FIFO has a collection of

16 elements, and each element includes 10 quadwords (i.e.10 x 64bits). RFIFO

element is composed of 8 quadwords (64bytes) for data, 1 quadword for status, and 1

quadword for extended data field. TFIFO element is made of 8 quadwords (64bytes)

for data, 1 quadword for control, and 1 quadword for prepend field. In addition, the

FBI also controls the 64bit IX Bus interface. It consists of transmit and receive state

Page 43: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

35

machine which operate independently and in parallel, ready bus sequencer, and IX

Bus arbiter. Besides, the FBI Unit contains control and status register (CSR)

accessible by both the Microengine and the StrongARM core. Figure 4-3 presents

FBI Unit architecture.

Figure 4-3. FBI Unit Architecture

Figure 4-4 illustrates relationship between Ready bus and a MAC device. A

typical MAC device (such as the Intel@ 21440 Octal 10/100 Mbps Ethernet

controller) usually provides transmit and receive Ready flags that indicate whether

the amount of data in a FIFO has reached a certain threshold level. The Ready Bus

Sequencer in IXP1200 periodically polls the Receive and Transmit FIFO Ready

8 commandPull Queue

8 commandHash Queue

8 commandPush Queue

fast _ wr

AMBA (Core) Command BusMicroengine Command Bus

TFIFO16 elements

(10 quadwords each)

CSRs

From SDRAM

Pull Engine

TFIFO RdCRS/ScratchHash RdPull Command

CRS/Scratch

Hash Return

Push commandRFIFO

From SRAMMicroengineWrite TransferRegister

To SRAMMicroengineRead TransferRegister

To SDRAM

Push Engine1k x 32Scratchpad

Hash Unit

IX Bus Interface

Ready BusSequencer

TransmitState Machine

ReceiveState Machine

IX Bus Arbiter

64-bit IX Bus

Ready Bus

RFIFO16 elements

(10 quadwords each)

Push and Pull Engine Arbiters

Page 44: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

36

Flags and places them into FBI registers (RCV_RDY and XMIT_RDY). Software, for

example semaphore, then reads those flags and figures out whether the

corresponding port is ready for receiving or transmitting. As just described, FBI unit

manages multiple ports sharing with combined effort of MAC devices.

Figure 4-4. Ready Bus and Ready Flags

4.3 Microengine Pipelining

In IXP1200, each Microengine executes five-stage pipeline operation shown in

Figure 4-5. The pipeline is composed of P0=Instruction Fetch (F), P1=Decode (D),

P2=Read operands(R), P3=Execute (E), and P4=Write (W). Each stage takes one

Microengine cycle. In F-stage, there are four program counters (PCs) multiplexed for

Page 45: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

37

operating four threads in a Microengine. The context arbiter receiving signal events

determines next instruction address, and then the next instruction is fetched from

4kbytes Microprogram store. D-stage decodes the fetched instruction, and passes

the immediate part to the next stage if necessary. In R-stage, two operands are

respectively read from GPRs, SDRAM transfer registers (SDRAM xfers), SRAM

transfer registers (SRAM xfers), Pipe Latch D/R, or Pipe Latch E/W. E-stage

performs ALU or shift operation for four threads sharing ALU and shifter. In

W-stage, the result of ALU/Shift operation is written into SDRAM xfers, SRAM xfers,

or GPRs.

Figure 4-5. Microengine Pipeline

In the micro architecture, there is no structural hazard basically because a

Page 46: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

38

Microengine doesn’t have the data fetch stage normally included in general purpose

RISC CPUs and therefore F-stage doesn’t share memory with other stages. A

Microengine accesses Microprogram store to reads an instruction only in F-stage.

Regarding Data Hazard, this pipeline operation originally doesn’t have Write After

Read (WAR) and Write After Write (WAW) hazards since it takes five-stage pipeline

and Read/Write position is respectively specified in the pipeline. However, this

architecture adopts data forwarding from E-stage to E-stage and from E-stage to

R-stage so that it can avoid Read After Write (RAW) hazards as you can see in

Figure 4-5. Moreover, there are several control hazards caused by branch

instructions and context switches (not shown in Figure 4-5). The influence of the

control hazards is dependent on the type of instructions. In Section 4.5, branch and

context switch decision mechanism are described, and how they generate aborted

cycles is shown.

4.4 Memory access

One of unique characteristics of Microengine is the way for accessing memory

like SDRAM and SRAM. When a Microengine transfers data to and from a memory,

the data goes through either SDRAM or SRAM transfer registers (xfers) instead of

directly accessing. Figure 4-6 depicts memory access flow of SDRAM and SRAM. For

a SDRAM write operation, a Microengine first stores the write data into the SDRAM

transfer registers and then issuing SDRAM write command to the SDRAM unit. The

memory controller, namely the SDRAM unit, executes DMA-like transfer from the

Page 47: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

39

appropriate SDRAM transfer registers to the SDRAM at the time. For a SDRAM

read operation, a Microengine issues the read request to the SDRAM unit at first.

Then, the SDRAM unit pulls the data out of the SDRAM and deposits it in the

specified SDRAM transfer registers. When the Microengine is notified that the data

has been written into transfer registers, it can then read the data out of its transfer

registers. Similarly, the SRAM unit operates transfer between the SRAM and the

SRAM transfer registers. In fact, the SRAM unit transfer covers other resource’s

transfers such as R-FIFO, T-FIFO, CSRs, Hash Unit, and Scratchpad Memory. As a

result, the context switch can be realized with little overhead and hide a lot of stall

cycles based on the combination of these transfer registers and other context

resources (i.e. PCs and GPRs), and context switch arbiter. Strictly speaking, there

could be one cycle as a maximum overhead. The reason why context switch has one

cycle overhead is addressed in Section 4.5.

Figure 4-6. Memory Access flow

Page 48: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

40

4.5 Branch and Context Switch Mechanism

This section describes branch decision mechanism and branch penalties

resulting from control hazards in the execution pipeline. In addition, since the

context switch can also cause control hazards as well and perform in an analogous

way to branch instructions, both of them are explained here at the same time.

Moreover, three supportive solutions are introduced for avoiding branch penalties in

the Microengine; deferred branches, setting condition codes earlier, and branch

guessing.

First of all, the branch instructions can be categorized into three classes shown

in Table 4-1. Only Class1 instructions actually include context switch instructions.

The branch decisions are made in either the P1 (D-stage), P2 (R-stage), or P3

(E-stage) based on these classes. I explain about how each class performs branching.

Table 4-1. Instructions Categorized by Class

Class 3 Class2 Class1

br_bclr and br_bset br=0 br sdram

br=byte and br!=byte br!=0 br=ctx sram

jump br>0 br!=ctx hash1_48

rtn br>=0 ctx_arb hash2_48

br_!signal br<0 csr hash3_48

br_inp_state br<=0 r_fifo_rd hash1_64

br=cout t_fifo_wr hash2_64

br!=cout scratch hash3_64

Note: Blue colored instructions indicate context switch instructions.

Page 49: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

41

4.5.1 Class3 Instructions

Class3 instructions always make the branch decision in the P3 E-stage.

Figure4-7 represents an example of a pipeline operation of a class3 instruction.

Since the instruction, branch on bit clear (br_bclr), takes a branch on the basis of

whether the specified bit of the register is clear or set, alu operation must perform

and set the condition code before br_bclr execution. In fact, class3 instructions

include not only instructions requiring the condition code set but also instructions

not requiring the condition code set. If necessary, the condition code is set in the

E-stage of another instruction, and the result is passed to the next E-stage for the

branch instruction. The condition code is once latched and then the branch

instruction determines if the branch should be taken or not taken. If not taken, the

pipeline goes through as a normal pipeline stream. However, if taken, three

instructions after the branch instruction are squashed and aborted, and then the

pipeline is started from the target instruction. In other words, the Microengine has

the control hazard and the class3 instructions normally have three branch penalty

cycles.

Figure 4-7. Branch pipeline example with class3 instruction

Page 50: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

42

4.5.2 Class2 Instructions

Class2 instructions make the branch decision in either D-stage or R-stage. The

decision depends on when the condition code is set. I show two possible cases in

Figure 4-8 and 4-9. The condition code is generated by alu instruction in the E-stage

as well. In the class2 instructions, the condition can be directly passed into a branch

decision stage (either R-stage or D-stage) without a pipeline latch. Suppose the

branch instruction executes right after an instruction sets the condition code, the

branch decision is located in the R-stage (Figure4-8). If not taken, the pipeline

performs as normal. If taken, two instructions are aborted from the next stage of the

branch decision, and then the target instruction is fetched. Therefore, we have two

branch penalties in this case. Suppose an instruction is inserted between the

condition code set instruction and the branch instruction, the earlier branch decision

causes one instruction to be aborted if taken (Figure4-9). If not taken, the pipeline

just goes straight.

Figure 4-8. Branch pipeline example with class2 instruction (case1)

Page 51: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

43

Figure 4-9. Branch pipeline example with class2 instruction (case2)

4.5.3 Class1 Instructions

Class1 instructions can be classified into two groups; branch instructions and

context switch instructions. The context switch instructions change the execution

context as well as branch to the next instruction that is to be executed in another

context. In class1 instructions, the branch decision is made in the D-stage, after the

initial decoding of the instruction (Figure4-10). The context switch decision is also

similarly made in the D-stage, and then the result is sent to context arbiter. Once

the instruction is decoded, all the information is available to make the branch

decision. If branch is not taken or context is not switched, the pipeline execution

performs in a straight without squashing any instruction. If taken or switched,

there is one penalty because the branch decision can not be made before D-stage.

Figure 4-10. Branch pipeline example with class1 instruction

Page 52: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

44

4.5.4 Solutions for branch penalties

There are some solutions for branch penalties. First of all, the branch

instructions of Microengines support deferred branch alternatives, which use the

“defer” optional token within an instruction. Although software programmers can

manually set the option for each branch instruction, the IXP1200 Assembler

supports an optimization that automatically performs deferred branch optimization.

The deferred branch option can reduce or eliminate aborted instructions in the

execution pipeline. In a deferred branch, an instruction following a branch decision

is allowed to execute before the branch takes effect. Figure4-11 presents the pipeline

in the case that the deferred branch is taken. Since the instruction is class3, the

deferred optional token can fill up to three instructions before the pipeline is

branched and hide the branch latency. The number of instructions that can be

deferred depends on the instruction class. This option can be applied to context

switch instructions as well. As a result of using deferred branches, the computation

efficiency could be improved considerably.

Figure 4-11. Branch pipeline example with deferred branch instruction

Page 53: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

45

Secondarily, regarding class 2 instructions, setting condition codes early allows

one aborted instruction to be reduced because the branch decision can be made one

cycle earlier. This would be easily comprehended by comparison between Figure 4-8

and Figure 4-9.

Thirdly, the IXP1200 supports the guess branch that prefetches an instruction

from the branch-taken path before it makes the actual branch decision. This option

is provided by the guess branch optional token within an instruction in the same

way as the deferred branch. Table4-2 shows what instructions can support guess

branch optional token. In Figure4-12, if the guess branch is taken, one aborted cycle

could be generated. However, if guess branch is not taken, two aborted cycles could

be caused by miss branch prediction. We also can combine guess branch with

deferred branch for hiding branch penalty of taken path, which is depicted in

Figure4-13.

Table4-2. Guess Branch Instructions

Supports guess_branch Not support guess_branch

br_bset br=cout br<0 br br!=byte

br_bclr br>0 br=0 br=ctx Jump

br_inp_state br!=cout br<=0 br!=ctx Rtn

br_!signal br>=0 br!=0 br=byte

Page 54: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

46

Figure 4-12. Branch pipeline example with guess instruction

Figure 4-13. Branch pipeline example with guess and deferred branch options

Page 55: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

47

5. IXP1200 Network Processor Evaluation 5.1 Methodology

What I study in this paper is computer architectural properties of an emerging

well-known network processor, Intel’s IXP1200, especially focusing on packet

processing on the forwarding path. To study the architectural characteristics, I

decide to look at metrics of Microengines, such as Instruction Mix, Latencies in

accessing memory, “Execution, Aborted, Stalled, and Idle” ratio, Cycle per

instruction (CPI), and Throughput on the basis of four workloads; 64bytes, 594bytes,

1518bytes fixed size packet workloads and mixture packet workload. I actually

chose to run my evaluation on the simulator shipped with the IXP1200 development

environment because the actual hardware does not provide any fine-grained

performance information. The simulator also guarantees that there are always

packets available at each input port.

The simulator environment actually consists of 1) GUI interface to all

Microengine tools which is called Workbench GUI, 2) Microcode assembler, 3)

Microcode linker, 4) Debug and simulation engine called Transactor including

IXP1200 Architectural Model and Memory, and 5) API that enables a C, C++, or

Verilog model of an IX bus device (i.e. MAC device) to communicate with the

Transactor and simulate interaction between them, which is called Simulation

Extention. In addition, I employed a reference program written by Microcode

Assembler language, including forwarding implementation of Microengines for the

evaluation. The reference program is named L2L3fwd16, and provided with

Page 56: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

48

simulator environment by Intel. Refer to the pseudo codes of receive and transmit

thread main loops in Section 3.3. The simulation programs assume a router with 16

x 100Mbps Ethernet ports, and include Router processing such as input scheduler,

filter, forwarder, IP lookup, and output scheduler. In the program, receive threads

are assigned to Microengine0-3. Hence, sixteen threads can run independently and

be assigned to sixteen ports respectively. Transmit threads are assigned to

Microengine 4-5. Since one thread per Microengine is assigned to output scheduler,

other three threads per Microengine are assigned to transmit tasks running

independently. In the simulation, Microengines operated at 232MHz, and the IX bus

transferred packets at 104MHz. Two 8ports 100Mbps MAC devices (Intel IXF440)

were connected to IXP1200. Then, the bus frequency of SRAM and SDRAM was

116MHz.

Regarding workload, I respectively set each workload up and sent packets to

input ports randomly. During each experimental run, the simulator had to forward

3000 packets. I chose this because of the long running time of simulator (about 1

hour per run in Windows2000 on a Pentium IV 1.6GHz). Even though this number

of packets sounds like small, I’m confident that the results for more than 3000

packets should look similar to the results presented in this paper. The evaluation

results are gathered both from statistics directly given by the simulator and from

scripts written to process the simulator’s output.

Page 57: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

49

5.2 Instruction Mix

Six Microengines are simulated on the basis of four workloads, 64 Bytes, 594

Bytes, 1518 Bytes, and mixed Ethernet Packets including IP information. Figure5-1

to 5-3 show the distribution of five categorized Microengine’s instructions; called

Instruction Mix. There are 1) Arithmetic, Rotate, and Shift Instructions, including

move operation and Condition Code set for Branching, 2) Branch and Jump

Instructions, 3) Reference Instructions, which basically support data transfer

between memory, such as SRAM, SDRAM, Scratchpad RAM, RFIFO, TFIFO, and

even CSR (Control Status Register), and SRAM/SDRAM transfer registers in

Microengines, 4) Local Register Instructions, including load instructions of

immediate data with/without shift operation 5) Miscellaneous Instructions, which

includes nop, hashing, and context swapping-out operation. In the simulation,

Microengine 0, 1, 2 and 3 actually dedicates receiving operation, and Microengine 4

and 5 executes transmitting operation. The raw data of Instruction Mix are shown

as spreadsheets in Appendix C.

In Figure 5-1, it turns out that “Arithmetic, Rotate, and Shift Instructions” and

“Branch and Jump Instructions” have high proportion for receive Instruction Mix.

They are mainly used for packet parse, header check and modification. Depending

on the increased chance of header check and modification (i.e. smaller packet size),

alu and alu_shf instructions are used in many cases. Ld_field and ld_field_w_clr

operation (load byte into specific field) also highly affect the proportion of “Local

Register instructions” due to header modification. In addition, the ratio of “Local

Page 58: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

50

Register Instructions” could be associated with semaphore operation, that is used

for receive threads in order to control utilization of the buses and internal queues.

Since 64 bytes packet is smaller than others, the frequency of blocking and releasing

resources becomes higher. In fact, immed instructions (load immediate word and

sign extend or zero fill with shift) are regularly used to set and clear control value of

semaphore in GPR. Besides, the reason why the ratio of “Reference Instructions”

slightly increases according to smaller sized packets is that the frequency of reading

Ethernet header from RFIFO to transfer registers goes up because of repeated

header checking. Additionally, the Instruction Mix of mixed packets seems to be

affected by the maximum size packet because the proportion is analogous to the

Instruction Mix of 1518 bytes packets although the average size of packets is 406

bytes.

40.8%

32.5%

30.3%

31.9%

28.0%

37.8%

40.8%

39.8%

16.6%

14.2%

7.6%

5.8%

5.3%

5.6%10.0%

7.2%

7.3%15.2%

16.4%

6.9%

0% 20% 40% 60% 80% 100%

64B

594B

1518B

Mixture

Pac

ket

Typ

es

Instruction Ratio

Arithmetic,Rotate, andShift InstructionsBranch and JumpInstructionsReference Instructions

Local RegisterInstructionsMiscellaneousInstructions

Figure 5-1. Instruction Mix for Receiving Packets

Page 59: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

51

Transmit Instruction Mix is presented in Figure5-2. The ratio is more

unchanging than receive Instruction Mix. Especially, three conditions except for

64bytes are very similar. However, there are some points we can find out from the

result. The ratio of “Reference Instructions” on 64 bytes packets workload looks

higher than others because the access to Scratchpad increases. In fact, a transmit

thread reads “Ports with Packets (PWP)” from Scratchpad, which is set by a receive

thread after enqueuing a packet in SRAM, and then the frequency of setting PWP

depends on the packet size of workload. In addition, since a thread handling small

packet has to update transmit queue descriptor frequently by using ld_field_w_clr,

the ratio of “Local Register Instructions” slightly increases. The total distribution of

instructions for router processing is shown as Figure 5-3.

48.2%

51.3%

50.9%

50.7%

30.7%

30.7%

31.1%

31.0%

10.6%

8.2%

8.5%

8.5%

8.2%

8.6%

8.6%

8.7%

2.4%

1.2%

0.9%

1.1%

0% 20% 40% 60% 80% 100%

64B

594B

1518B

Mixture

Pac

ket

Typ

es

Instruction Ratio

Arithmetic,Rotate, andShift InstructionsBranch and JumpInstructionsReference Instructions

Local RegisterInstructionsMiscellaneousInstructions

Figure 5-2. Instruction Mix for Transmitting Packets

Page 60: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

52

43.4%

39.8%

38.4%

39.2%

29.0%

35.1%

37.0%

8.6%11.7%

12.0%

13.4%

12.7%36.4%

6.5%

6.9%

6.6%

4.9%

4.7%

6.6%

7.4%

0% 20% 40% 60% 80% 100%

64B

594B

1518B

MixtureP

acke

t T

ypes

Instruction Ratio

Arithmetic,Rotate, andShift Instructions

Branch and JumpInstructions

Reference Instructions

Local RegisterInstructions

MiscellaneousInstructions

Figure 5-3. Instruction Mix for Overall Processing

5.3 Latency

In general, accessing external resources such as SDRAM and SRAM often

causes a serious deterioration of performance for a processor because of the large

latency. This section addresses the IXP1200’s latencies for SDRAM and SRAM. The

cumulative distributions of latencies are respectively presented based on the

simulation of 64bytes packet workload. Note that all the figures in this section have

different horizontal scales.

Figure 5-4 depicts the cumulative distribution graph for the latencies in

accessing SDRAM. The presented data covers only read operation from reference

resources to Microengines. In the graph, Microengines 0 to 3, each processing

receive thread, make the same kind of curve mostly. It means that the memory

bandwidth is almost equally shared among the four Microengines. The minimum

Page 61: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

53

latency for referencing SDRAM is 43 cycles, and 50 % of the SDRAM accesses take

at least 75 cycles to finish. Such a long latency results in a long stall for a processor

operation and degrade of performance in general. In the worst case, the latency to

access SDRAM can take up to 220 cycles.

Figure 5-4. SDRAM Latency

Figure 5-5 and 5-6 show the cumulative distribution graph for the latencies in

referencing SRAM. There are two ways to access the SRAM memory; unlocked

(Figure 5-5) and locked (Figure 5-6). The SRAM controller maintains an 8-entry

Content-Addressable Memory (CAM). The CAM is used to protect an area in SRAM

from being accessed by two or more processes (StrongARM core and Microengine

threads) at the same time. The Microengines can access the Read Lock CAM by

using the sram instruction. In the L2L3fwd16 program, all ports have a transmit

0

20

40

60

80

100

40 60 80 100 120 140 160 180 200 220 240

cycles

cum

ula

tive

per

cen

tage

Microengine0Microengine1Microengine2Microengine3

Page 62: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

54

queue associated with them. Each transmit queue has a queue descriptor which

contains head and tail pointers and a count of packets in the queue. Since the

queues are shared by receive and transmit threads, the thread first acquires a read

lock prior to modifying the queue descriptor. The read_lock command locks the

address and returns the contents of memory. The memory location is unlocked by

using either the unlock command or the write_unlock command in the Microcode

Assembler. In fact, Microengines 0 to 3 shows data for receive threads and

Microengines 4 and 5 presents data for transmit threads and schedulers.

These two graphical forms look very similar actually. As shown, the shapes of

Microengines 0-3 are similar in two graphs. Microengines 4 and 5 also make the

almost same curve, but they are different from Microengines 0 to 3. The SRAM

latencies are much smaller than the SDRAM latencies. For the unlocked access, the

minimum latency is 16 cycles for Microengines 0 to 3 and 18 cycles for Microengine4

and 5, and then 50% of the access take at most 24 cycles for Microengines 0 to 3 and

21 cycles for Microengine 4 and 5 to complete. For the locked case, the minimum

latency is 20 cycles for Microengines 0 to 5, and about 50% of the SRAM access are

at most 26 cycles for Microengines 0 to 3 and 22 cycles for Microengines 4 and 5 to

complete. As a result, the unlocked access is somewhat faster than the locked access.

The maximum latency for accessing unlocked and locked SRAM memory is 204 and

251 cycles respectively. In reality, IXP1200 has other memory accesses, which could

cause huge latencies. In Appendix D, I illustrate graphs of latency in accessing the

receive FIFO buffer, the Scratchpad RAM, the FBI CSR, and the Hash unit, and all

collected data.

Page 63: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

55

0

20

40

60

80

100

15 35 55 75 95 115 135 155 175 195 215 235

cycles

cum

ula

tive

per

cen

tage

Microengine0

Microengine1

Microengine2

Microengine3

Microengine4

Microengine5

Figure 5-5. SRAM Latency (unlocked)

0

20

40

60

80

100

20 40 60 80 100 120 140 160 180 200 220 240

cycles

cum

ula

tive

per

cen

tage

Microengine0

Microengine1

Microengine2

Microengine3

Microengine4

Microengine5

Figure 5-6. SRAM Latency (locked)

Page 64: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

56

5.4 Execution, Aborted, Stalled and Idle Ratio

As shown in Section5.2, it turns out that memory access generates very long

latency. In multi processors system sharing a memory, latency generally causes

numerous CPU stall cycles, consumes execution time, and then becomes a

bottleneck of the system performance. However, with hardware multithread assist,

the IXP1200 can hide such long latency dexterously and work out this issue almost

perfectly.

In this section, I present the distribution of execution, aborted, stalled, and idle

cycles for each Microengine. Each distribution for four workloads is shown in from

Figure 5-7 to Figure 5-10. Taken as a whole, Microengines can execute at

apptoximately 60% to 75%, otherwise taking aborted or idle cycles, due to the

hardware multithreading. In fact, stalled ratios are extremely low, and almost zero.

Therefore, these graphs prove an advantage of Microenignes in hiding latency. Since

the same receive program is assigned to Microengine 0 to 3, these distributions are

similar in Microengine 0 to 3. Distributions of Microengine 4 and 5, running the

transmit thread, are alike because of the same reason.

However, there seems to be a room for technically improvement of Microengine.

We actually should not overlook the ratio of aborted cycles. Although aborted cycle is

normally occurred in branch and jump instructions, context switching also causes

the aborted cycles (one cycle overhead per one context switch) as described in

Section 4.5. The number of aborted is not trivial, and in fact looks large; from 26.4%

to 41.7% in Microengine 0 to 3 and from 23.3% to 25.3% in Microengine 4 and 5, of

Page 65: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

57

course, even though the impact should be much better than stall overhead in typical

multi processor system. In addition, as the packet size of workload is made larger,

the aborted distribution increases in Microengine 0 to 3. The reason is that the

frequency of memory reference goes up for packet transferring with larger packet

size, a n d context switch overhead affects the distribution. The development of

hardware branch and context switch prediction or speculation technique could

improve the performance of the IXP1200 effectively. Otherwise, the optimization of

the IXP1200 Assembler could be needed.

A multithreading example is presented in Appendix E. The illustration shows

history window in GUI workbench, and indicates how four threads operate in each

Microengine 0 to 2.

73.4

73.7

69.1

69.4

69.3

69.1

25.3

25.1

26.4

26.5

26.4

27.3

4

3.7

3.80.5

0.5

0.4

0.5

1.1

1.2

3.2

0% 20% 40% 60% 80% 100%

Microengine5

Microengine4

Microengine3

Microengine2

Microengine1

Microengine0

ratio

Executing

Aborted

StalledIdle

Figure 5-7. Executing, Aborted, Stalled, and Idle ratio on 64bytes Workload

Page 66: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

58

75.8

76.1

60.2

60.3

60.4

60.4

23.6

23.3

37.7

37.7

37.6

37.9

2

0.6

0.7

0.1

0.1

0.1

0.1

1.9

1.9

1.6

0% 20% 40% 60% 80% 100%

Microengine5

Microengine4

Microengine3

Microengine2

Microengine1

Microengine0

ratio

ExecutingAborted

StalledIdle

Figure 5-8. Executing, Aborted, Stalled, and Idle ratio on 594bytes Workload

74.4

74.2

57.7

57.6

57.6

57.8

25

25.3

41.6

41.7

41.6

41.6

0.6

0.5

0.6

0.7

0.7

0.7

0% 20% 40% 60% 80% 100%

Microengine5

Microengine4

Microengine3

Microengine2

Microengine1

Microengine0

ratio

ExecutingAborted

StalledIdle

Figure 5-9. Executing, Aborted, Stalled, and Idle ratio on 1518bytes Workload

Page 67: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

59

74.5

75

58.2

58.3

58.2

58.6

24.7

24.4

41.5

41.1

41.5

41.1

0.7

0.7

0.2

0.3

0.3

0.3

0% 20% 40% 60% 80% 100%

Microengine5

Microengine4

Microengine3

Microengine2

Microengine1

Microengine0

ratio

ExecutingAborted

StalledIdle

Figure 5-10. Executing, Aborted, Stalled, and Idle ratio on Mixture Workload

5.5 CPI (Cycle per Instruction)

Figure 5-11 presents CPI of six Microengines for four workloads. The CPI is

limited to 1 by the number of pipelines in each Microengine because of the lack of

out-of-order processing and speculation. Another finding is that the amount of CPI

increases as the size of the data packets increases. The reason is that when the

packet size increases, more time needs to be spent transferring the packet to and

from the various registers and memory locations in the IXP1200, whereas the

overhead for header and route lookup processing stays constant. In addition,

although the average packet size of mixture is 406bytes and less than 594bytes

workload, the CPI values are larger than that of 594bytes in receive threads. The

Page 68: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

60

reason could be that mixture workload is affected by aborted as described in Section

5.3 and then increases CPI.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

64BPackets-uEngine5

594BPackets-uEngine5

1518BPackets-uEngine5

MixturePackets-uEngine5

64BPackets-uEngine4

594BPackets-uEngine4

1518BPackets-uEngine4

MixturePackets-uEngine4

64BPackets-uEngine3

594BPackets-uEngine3

1518BPackets-uEngine3

MixturePackets-uEngine3

64BPackets-uEngine2

594BPackets-uEngine2

1518BPackets-uEngine2

MixturePackets-uEngine2

64BPackets-uEngine1

594BPackets-uEngine1

1518BPackets-uEngine1

MixturePackets-uEngine1

64BPackets-uEngine0

594BPackets-uEngine0

1518BPackets-uEngine0

MixturePackets-uEngine0

CPI

Figure 5-11. CPI for Microengines

5.6 Throughput

This section reports the results of throughput for a router with 16 x 100Mbps

ports based on four workloads. First of all, a theoretical throughput of a fixed size

workload can be calculated by next equations.

Page 69: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

61

Packet arrival rate per port (pps) = 100Mbps / ((12bytes (IFG*1) + 8bytes

(Preamble/SFD*2) + packet size (bytes)) x 8bits) (1)

Note: *1 IFG: Inter Frame Gap, *2 SFD: Start Frame Delimiter

Throughput per router (pps) = (Packet arrival rate per port) x (# of ports) (2)

Suppose the router receives 64bytes packets at 100Mbps bandwidth for each

port and forwards all of them to output ports, the ideal throughput is estimated as

0.1488Mpps(arrival rate per port) x 16 ports = 2.38Mpps. Likewise, throughput of

other workloads can be calculated. In addition, this simulation assumes that a

packet is always ready in a port without idle. In Figure 5-12, simulation results are

presented with ideal simulation rate and theoretical OC-24 pps (Packet per second)

values according to four workloads. Since the physical bandwidth of OC-24

(1.24Gbps) is close to the simulation bandwidth (16 x 100Mbps = 1.6Gbps), I put the

theoretical value in the graph. As seen, there are no differences between simulation

throughput rates and ideal throughput rates in terms of three fixed size workloads.

The mixture workload seems to have slight gap. Even though I used the average

packet size 406 bytes to calculate the ideal throughput of the mixture workload on

equation (1) and (2), it could not be precise value for mixture workload because three

different sized packets are sent randomly. Otherwise, mixture workload could give

IXP1200 heavier load than fixed size workload.

Page 70: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

62

0.40

0.130.33

2.38

0.47

0.130.33

2.38

0.38

0.100.26

2.83

0.00

0.50

1.00

1.50

2.00

2.50

3.00

Mixture 1518bytes 594bytes 64bytes

Mpp

s

Sim Rate

Ideal Sim Rate

OC-24(CRC16)

Figure 5-12. Throughputs (bounded)

In addition, what we find out from the graph is that the theoretical throughput

of OC-24 is higher than the simulation value on 64 bytes packets workload though

the physical bandwidth is lower. The reason is that the protocol overhead of

Ethernet (38bytes) is much larger than that of OC-24 POS (Packet over SONET)

(7bytes). 38-byte means 82.6% overhead, and 7-byte represents 15.2% overhead for

46bytes IP packets. As such, in general, real throughput of IP packet depends not

only on router processing ability but also on media and protocols overhead because

different media and protocols such as Ethernet, SONET, and ATM have different

size of overhead. In Appendix F, the theoretical throughputs of IP packets are

explained based on different encapsulations.

Regarding NP performance evaluation, the simulation results would not be able

to compare with such values directly. However, we could say that the processing

ability of IXP1200 should be enough to forward packets of OC-24 level speed because

Page 71: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

63

suppose 64-byte Ethernet formatted packets go through OC-24 class bandwidth, the

throughput is calculated as 1.85 pps, that is lower than the simulation result.

To know how fast packets stream IXP1200 can process, I simulated four

workloads on the condition of unbounded execution. In IXP1200 simulator, there are

two simulation conditions for execution, “bounded” and “unbounded”. The bounded

execution assumes a real system environment, and is actually used to collect data of

Figure 5-12. In this condition, data is received from and transmitted to the network

at the specified data rate with an inter frame gap (IFG). Even if the processing

capability of IXP1200 is over the wire speed, the throughput is converged to the wire

speed.

On the other hand, the unbounded execution can evaluate the maximum packet

processing capability of IXP1200 for infinite wire speed. It means that simulator has

data always ready to be received by the IXP1200 without IFG and has the ports

always ready to receive data from the IXP1200. This makes the simulation act as if

data is coming from and going to the network at infinite speed, bypassing the receive

and transmit buffer in MAC devices. The throughputs on the unbounded execution

are shown in Figure 5-13. The graph includes theoretical throughputs based on the

assumption that Ethernet packets come and go through at 1.244Gbps(OC-24 class)

and 2.488Gbps(OC-48) bandwidth and ignore IFG although it is not real protocol. As

a consequence, IXP1200 could achieve from 70% to 88% throughput compared with

theoretical throughput of OC-48 class. Even though IXP1200 could not be enough to

forward packets of OC-48 at wire rate, the simulation results imply that only one NP

processing ability is obviously approaching to OC-48 class.

Page 72: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

64

0.58

0.150.46

3.07

0.380.10

0.26

2.16

0.75

0.200.52

4.32

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

Mixture 1518bytes 594bytes 64bytes

Mpp

s Sim Rate

1.244GEther(OC-24class)

2.488GEther(OC-48class)

Note: These throughputs don’t include 12bytes IFG overhead.

Figure 5-13. Throughputs (unbounded)

Page 73: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

65

6. Other Network Processors

The microarchitecture of network processors is marked by emphasis on

streaming data throughput and heavy use of architectural parallelism. Chip

multiprocessing with hardware multithreading seems to be a popular technique to

exploit the huge thread-level parallelsm available in packet processing workloads.

In reality, most of the NPs supports fast context switching as well as IXP1200. For

example, IQ2000 has four 32-bit scalar cores with 64-bit memory interfaces, so each

core can perform a double-word load or store in a single clock cycle [2]. Each core has

5 identical register files (32 registers, 32 bits wide, triple-ported). This arrangement

allows each core to run five concurrent threads of execution with fast context

switching. In addition, Xstream Logic’s network processor is based on the dynamic

multistreaming(DMS) technique(also known as simultaneous multithreading) [14].

The processor core can support eight threads. Each thread has its own instruction

queue and register file. The core is divided into 2 clusters of 4 threads each. Every

clock cycle, each cluster can issue up to16 instructions-four from each thread and

four of the 16 are selected and dispatched to one of the four functional units in that

core for execution. The DMS core has 9 pipe stages and features a MIPS like ISA.

This section describes characterization of other Network Processors, especially

focuses on their instruction set, context switching, and branching. In fact, three

popular network processors including different fast context switch features are

introduced ; Lexra’s NetVortex [15], Motrola’s C-port [3], and IBM’s PowerNP [4].

Page 74: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

66

6.1 Lexra’s NetVortex

Lexra’s NetVortex is based on 32-bit MIPS-1 architecture and allow up to 8

contexts per processor. Each context includes 32 general registers r0-r31, its own

program counter called CXPC and a status register called CXSTATUS. The status

register allows the program to set the I/O and software events on which the thread

will wait and the priority of the thread. NetVortex actually can perform 18 extended

instructions presented in Table 6-1 in addition to general MIPS-1 instructions. Note

that MIPS-1 Instruction set is shown in Appendix G. In fact, 6 instructions support

context switching among different threads and hides numerous latency of memory

load and store. In addition, the instruction set includes some new bit-field

instructions that make it easier to analyze packet headers.

Figure 6-1 depicts fast context switch mechanism of NetVortex. The LW.CSW

instruction (load word with context switch) is located in the second instruction in

thread 1. NetVortex basically provides a delay slot after a branch or memory

reference and allows one more instruction to execute while the processor is fetching

data as well as MIPS architecture. NetVortex actually switches context to the next

available thread at the delay slot. The context program counter CXPC then displays

the fourth instruction, namely next instruction to the delay slot, and wait status is

set in CXSTATUS at that time. When thread2 begins to run, the CXPC is changed to

PC, which indicates global program counter containing current running instruction,

and CXSTATUS is set as active. When thread 2 encounters the next context-switch

instruction, the whole procedure repeats itself. In fact, when the memory reference

Page 75: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

67

is complete, the CXSTATUS of the thread changes its status from wait to ready and

becomes available thread.

Unlike conditional branches of MIPS-1, which the CPU must resolve in the

execute stage, all context switching instructions executes unconditionally. The CPU

discovers a context-switch instructions in the decode stage as well as IXP1200 and

always executes the following instruction (in the delay slot) like deferred option of

IXP1200 in order to avoid creating bubbles. As a result, it seems that programmer

skills or the compiler optimization is very important to fill the delay slot and avoid

the number of penalty cycles by context switching as well as IXP1200. In NetVortex,

only if no other thread is ready to resume execution, the CPU does the pipeline stall.

Table 6-1. NetVortex extended Instruction set

InstructionContext -Control Instructions

Description

MYCXPOSTCX

CSWLW.CSWLT.CSW

WD

WD.CSW

WDLW.CSWWDLT.CSW

Bit-Field InstructionsSETI

CLRIEXTIV

INSVACS2

Cross-Context Access Instructions

MFCXGMTCXGMFCXC

Read my contextPost event to a contextContext Switch

Load word with context switchLoad twinword* with context switchWrite descriptor to device

Write descriptor to device with context switchWrite descriptor to device,load word with context switch

Write descriptor to device,load twinword with context switch

Set subfield to ones

Clear subfield to zeroesExtract subfield and prepare for insertionInsert extracted subfieldDual 16-bit ones complement add for checksum

Move from a context general-purpose registerMove to a context general-purpose registerMove from a context -control registerMove to a context -control registerMTCXC

InstructionContext -Control Instructions

Description

MYCXPOSTCX

CSWLW.CSWLT.CSW

WD

WD.CSW

WDLW.CSWWDLT.CSW

Bit-Field InstructionsSETI

CLRIEXTIV

INSVACS2

Cross-Context Access Instructions

MFCXGMTCXGMFCXC

Read my contextPost event to a contextContext Switch

Load word with context switchLoad twinword* with context switchWrite descriptor to device

Write descriptor to device with context switchWrite descriptor to device,load word with context switch

Write descriptor to device,load twinword with context switch

Set subfield to ones

Clear subfield to zeroesExtract subfield and prepare for insertionInsert extracted subfieldDual 16-bit ones complement add for checksum

Move from a context general-purpose registerMove to a context general-purpose registerMove from a context -control registerMove to a context -control registerMTCXC

Note: Twin words are 64-bit values

Page 76: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

68

Thread Context 1(r0 - r31)Thread Context 1(r0 - r31)

Thread Context 2(r0 - r31)Thread Context 2(r0 - r31)

Thread1 CXPC = I4(T1)Thread1 CXSTATUS = WaitThread1 CXPC = I4(T1)Thread1 CXSTATUS = Wait

General PurposeRegister File

General PurposeRegister File

Context RegistersContext Registers

Thread 1 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): …

Thread 2 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): … Thread2 CXPC = PC

Thread2 CXSTATUS = ActiveThread2 CXPC = PCThread2 CXSTATUS = Active

PC = I1(T2)PC = I1(T2)

Context Switch to Thread 2Context Switch to Thread 2

Context Switch to next available threadContext Switch to next available thread

Thread Context 1(r0 - r31)Thread Context 1(r0 - r31)

Thread Context 2(r0 - r31)Thread Context 2(r0 - r31)

Thread1 CXPC = I4(T1)Thread1 CXSTATUS = WaitThread1 CXPC = I4(T1)Thread1 CXSTATUS = Wait

General PurposeRegister File

General PurposeRegister File

Context RegistersContext Registers

Thread 1 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): …

Thread 2 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): … Thread2 CXPC = PC

Thread2 CXSTATUS = ActiveThread2 CXPC = PCThread2 CXSTATUS = Active

PC = I1(T2)PC = I1(T2)

Context Switch to Thread 2Context Switch to Thread 2

Context Switch to next available threadContext Switch to next available thread

Figure 6-1. NetVortex Context Switch Mechanism

6.2 Motrola’s C-5

Motrola’s C-5 has a dedicated, programmable 16 Channel Processors (CPs) for

packet forwarding. Each CP consists of a Serial Data Processors (SDP), which

contains microcode-programmable components for receive and transmit processing,

and a Channel Processor RISC Core (CPRC), which performs packet processing via

special purpose instruction and data memory. CPRC actually supports scheduling

and characterizing packets, table lookup, making forwarding and filtering decision.

Besides, CPRC implements a subset of the MIPS-1 instruction set (excluding

multiply, divide, floating point, and Coprocessor Zero (CpO) instructions). Refer to

Appendix G. Even though the standard MIPS CpO instructions are not supported,

C-5 provides its own special purpose Coprocessor Zero registers shown in Table 6-2.

Page 77: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

69

The instructions of CPRC can be classified into five; 1) load and store 2) arithmetic

and logical, 3) jump and branch 4) coprocessor interface 5) miscellaneous

To achieve multiplexing processing among a number of different tasks, the

CPRC is configured to incorporate four sets of 32 internal registers and performs

context-switch under software or hardware interrupt. Therefore, C-5 can totally

provide 16 x4 threads for packet processing. Actual processing begins on a different

context in two cycles.

As described, the context switching o f C-5 depends on two ways, software

control or hardware interrupt. In software mechanism, context switching is

executed by the original Coprocessor Zero instructions. For example, MTC0 $1 $3

(where $1 specifies the destination context, and where $3 is the source or current

context) switches context $3 to context $1. Those contexts have no priority.

In hardware mechanism, the CPRC uses prioritized hardware interrupts, that

can be triggered from any bits in two event registers. In fact, hardware interrupts

employ a specific purpose register (K1) containing the program counter value and

the context number of the interrupted context. First of all, all insterrupts are

disabled until a restore from exception (RFE) instruction is executed. In the

interrupted context, the address of the next executed instruction is saved in K1.

Suppose RFE is executed, the program flow returns to the previously interrupted

context. Thus, context switching is performed.

Page 78: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

70

Table 6-2. C-5 Coprocessor Zero Register Definitions

Register Definition

R0 Whoami Register – Contains the DMEM base( hardcoded) for this CPRC

R1 Interrupt Table Register – Contains the vector address for INT 0

R2 Break Table Register – Contains the vector address for break 0

R3 Current Context Register – The two LSBs are the current context register

R4 DMEM Comparison Address – Contains the address at which debug pulse is generated

R5 DMEM Comparison Address Mask – Contains the mask for the DMEM address

R6 DMEM Comparison Data – Contains the data value for which debug pulse is generated

R7 DMEM Comparison Data Mask – Contains the mask for the DMEM data

R8 Interrupt Flag – The LSB in the Interrupt Flag

R9 Read/Write Mask – The two LSBs are the Read mask and the Write mask for R4 to R7

Register Definition

R0 Whoami Register – Contains the DMEM base( hardcoded) for this CPRC

R1 Interrupt Table Register – Contains the vector address for INT 0

R2 Break Table Register – Contains the vector address for break 0

R3 Current Context Register – The two LSBs are the current context register

R4 DMEM Comparison Address – Contains the address at which debug pulse is generated

R5 DMEM Comparison Address Mask – Contains the mask for the DMEM address

R6 DMEM Comparison Data – Contains the data value for which debug pulse is generated

R7 DMEM Comparison Data Mask – Contains the mask for the DMEM data

R8 Interrupt Flag – The LSB in the Interrupt Flag

R9 Read/Write Mask – The two LSBs are the Read mask and the Write mask for R4 to R7

6.3 IBM’s PowerNP

IBM’s PowerNP integrates 16 32-bit picoprocessors with 1 PowerPC core in a

single chip. Each picoprocessor has support for 2 hardwre threads. Each thread has

16 32-bit (or 32 16-bit) General Purpose Registers (GPRs). Two picoprocessors are

packed in a dyadic protocol processor unit (DPPC) and shares eight coprocessors

such as a tree search engine, semaphore, checksum, data store and so on. Four

threads actually performs context switching in a cluster.

In fact, each picoprocessor has a one-cycle ALU shared by two threads and

performs packet processing by the core instruction set, properly operation codes

(opcodes). The opcodes fall in four categories; 1) ALU opcodes, 2) control opcodes, 3),

data movement opcodes, and coprocessor execution opcodes. ALU opcodes are

categorized into five types; 1) Arithmetic immediate, 2)Logical immediate,

3)Compare immediate, 4) Load immediate, 5) Arithmetic/Logical Register, 6) Count

Leading Zeros. Regarding conditional branch operation of control opcodes, the

Page 79: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

71

operation depends on condition codes. All opcodes and condition codes are presented

in Appendix G.

Context switching occurs when the picoprocessor is waiting for a shared

resource (for example, waiting for one of the coprocessors to complete an operation,

return the results of a search, or access DRAM). Basically, the context switching is

handled by coprocessor execution opcodes. Figure6-2 shows the pseudo code of the

wait opcode as an example of coprocessor execution opcode. The wait opcode

synchronizes one or more coprocessors. The mask 16 field is a bit mask (one bit per

coprocessor) in which the bit number corresponds to the coprocessor number. The

thread stalls until all coprocessors indicated by the mask complete their operations.

Priority can be released with this command. The context switching actually doesn’t

have overhead between threads. Since eight coprocessors performs primary router

processing and reduces the path length of picoprocesser, there should be a big

advantage for parallel processing.

IF Reduction_OR(mask16(i) = coprocessr. Busy(i))THEN

PC <= stall

ELSE

PC <=PC +1

END IF

IF p=1 THEN

Priority Over(other thread)<= TRUE

ELSE

PriorityOwner(Other thread)<= PriorityOwner(Other thread)

END IF;

Figure 6-2. Coprocessor Execution Opcode Example (Wait Opcode)

Page 80: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

72

7. Conclusions and Future work

This paper has addressed the characterization of router functions and

workloads for evaluation of Network Processors (NPs). Based on four proposed

workloads; 64bytes, 594byets, 1518bytes and mixture, the analytical data of the

IXP1200 has shown some critical features of the microarchitecture associated with

router processing and the performance. It has presented that “Arithmetic, Rotate,

and Shift” and “branch and jump” instructions occupy high proportion for the

Instruction Mix. Even though the ratio is not different obviously on the different

sized packet workloads, the simulation result has shown that the proportion slightly

depends on them. Especially in the receive thread, as the size of packet increases,

the ratio of “Arithmetic, Rotate, and Shift” and “local register” increase due to

frequency of packet header processing. In addition, the simulation has presented

that IXP1200 almost completely hides huge latencies of memory reference

instructions with fast context switching. However, another critical issue has come

up. Aborted cycles occurred by branch and context switch are not small. We would

not be able to leave the issue alone. To reduce those cycles effectively and improve

performance, some dynamic hardware prediction or speculation could be necessary

for the NP in the future, otherwise the optimization of assembler and compiler.

Since NPs generally include a number of RISC cores and other network processing

components, it could be expensive to apply such techniques. Therefore, if considering

use of prediction or speculation technique, it would be necessary to apply a small

prediction buffer or history table as possible. Thus, there seems to be room for

Page 81: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

73

technological improvement of NPs’ hardware context switching and branching.

In addition, this paper demonstrated that the amount of CPI increases as the

size of the data packets increases particularly in the receive thread because the

large sized packet needs more time to transfer even though overhead for header and

lookup processing stays constant. Besides, Mixture workload seems to be affected by

the 1518 sized packets because the result is always close to the workload although

the average packet size is less than 594bytes.

In the bounded throughput evaluation, IXP1200 has succeeded in achieving

ideal throughput of 2.38 Mpps on the basis of minimum sized packet. In the

unbounded throughput, IXP1200 has achieved 3.07Mpps, which constitute

approximately 71% of theoretical throughput based on the virtual use of Ethernet in

OC-48 class physical line. In conclusion, only one NP processing ability is enough for

OC-24 but still not enough to accomplish OC-48 at the wire speed. However, since

the ability is obviously approaching to such kind of level, one NP will accomplish

OC-48 and more in the near future.

Page 82: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

74

8. Bibliography

[1] Intel Corporation. IXP1200 Network Processor Datasheet, December 2001

[2] T. Halfhill. Sitera Samples Its First NPU. Microprocessor Report, May 2000

[3] C-Port Corporation (Motrola). C-5 Network Processor Architecture Guide, May

2001

[4] IBM Microelectronics Division. Power NP NP4GS3 Network Processor

Datasheet, February 2002

[5] Internet backbone maps. http://www.nthelp.com/maps.htm

[6] ISP world. http://www.boardwatch.com/isp/bb/Backbone_Profiles.htm

[7] Cable & Wireless Global Internet backbone. http://www.sla.cw.net/sla/index.jsp

[8] Howard C. Berkowitz. Designing Routing and Switching Architecture for

Enterprise Networks, 1999

[9] Larry L. Peterson and Bruce S. Davie. Computer Networks Second Edition,

2000

[10] San Diego Super Computer Center. http://www.sdsc.edu/

[11] Agilent Technologies. http://advanced.comms.agilent.com/routertester/

[12] National Laboratory for Applied Network Research (NLANR), Measurement &

Operations Analysis Team. http://moat.nlanr.net/Datacube

[13] RFC879. http://www.faqs.org/rfcs/rfc879.html

[14] Linda Geppert. The New Chips on the Block. IEEE Spectrum, January 2001

[15] Tom R. Halfhill. Lexra’s NetVortex Does Networking. Microprocessor Report,

July 2000

Page 83: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

75

[16] Tom R. Halfhill. Intel Network Processor Tarets Routers. Microprocessor Report,

September 1999

[17] Intel Corporation. IXP1200 Network Processor Family Microcode Programmer’s

Reference Manual, December 2001

[18] Intel Corporation. IXP1200 Network Processor Family Hardwre Reference

Manual, December 2001

[19] Intel Corporation. IXP1200 Network Processor Family Development Tools

User’s Guide, December 2001

[20] David A. Patterson and John L. Hennessy. Computer Architecture A

Quantitative Approach Second Edition, 1996

[21] David A. Patterson and John L. Hennessy. Computer Organization & Design,

1998

[22] Tammo Spalink, Scott Karlin, Larry Peterson, Yitzchak Gottlieb. Building a

Robust Software-Based Router Using Network Procesors. 18th ACM Symposium

on Operating Systems Principles (SOSP’01), pages 216—229, October 2001

[23] Vitesee Semiconductor Corporation Samuel J. Barnett and Mark R.Fauber.

Network Processors Uncovering Architectural Approaches for High-Speed

Packet Processing, 2000

[24] Vitessee Semiconductor Corporation. IQ2000 Network Processor Product Brief,

2000

[25] K. Krewell. Agere’s Pipelined Dream Chip. Microprocessor Report, June 2000

[26] Tom R. Halfhill. Alliance Detours Into Routers. Microprocessor Report,

August 1999

Page 84: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

76

Appendix A: Pseudo Code

//**************************** Format of RCV_RDY_LO **************************

rr: Receive Ready Flags corresponding to each port

//****************************************************************************

receive_ready_check()

{

check_port:

recrdy_inflight_blocked:

if (@recrdy_inflight == SEMAPHORE_OPEN)

goto set_rec_ready

else

goto recrdy_inflight_blocked

end if

set_rec_ready:

@recrdy_inflight = SEMAPHORE_CLOSE

copy $rec_rdy <- CSR_RCV_RDY_LO

@recrdy_inflight = SEMAPHORE_OPEN

if (0 != (1 & ($rec_rdy >> rec_req(lower 5bits))))

goto receive_request

else

goto check_port

end if

receive_request:

return

}

Figure A-1. Receive Ready Check

//**************************** Format of RCV_REQ **************************************

31: 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr

31 30:29 28:27 26 25:22 21:18 17:16 15 14 13 12 11 10:6 5:3 2:0

RES FA TSMG SL E2 E1 FS NFE RES IGFR RES SIGRS TID RM RP

Page 85: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

77

RES: Reserved, FA: Maximum IX Bus Accesses, TSMG: Thread Message, SL: status length,

E2: Element2, E1: Element1, FS: Fast/Slow port mode, NFE: Number of FIFO Elements

IGFR: Ignore Fast Ready Flag, SIGRS: Signal Receive Scheduler, TID: Thread ID,

RM: Receive MAC, RP: Receive Port

//*************************************************************************************

receive_request()

{

port_rx_init(rec_req, rfifo_addr) // create canned receive_request

req_inflight_check:

if (@req_inflight == SEMAPHORE_OPEN)

goto set_rec_req

else

goto req_inflight_check

end if

set_rec_req:

@req_inflight = SEMAPHORE_CLOSE

$rec_csr = rec_req

copy $rec_csr -> CSR_RCV_REQ

return

}

Figure A-2. Receive Request Issue

Page 86: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

78

//**************************** Format of RCV_CTL *********************************

THMSG: Thread Message, MACPORTTHD: MAC Port Number/Header Thread ID,

SOPSEQ: Start of Packet Sequence Number, RF: Receive Fail, RERR: Receive Error,

SE: Second Element,FE: First Element, EF: Element filled, SN: Sequence Number,

VLDBytes: Valid Bytes, EOP: End of Packet, SOP: Start of Packet

//********************************************************************************

receive_status()

{

signal_receive:

wait_start_receive()

copy $rec_csr <- CSR_RCV_CNTL

if ( ($rec_csr & 1) >0)

goto sop

else

exception = 0x3 & ($rec_csr >>18) // save RF and RERR (RCV_CNTL[19:18])

rec_state = (rec_state, byte_enable(1101)) + (($rec_csr << 8), byte_enable(0010))

// save VLDBytes, EOP, SOP (RCV_CNTL[7:0])

goto done

end if

sop:

rec_state = 0 + (($rec_csr << 8), byte_enable(0010)) // initialize VLDBytes, EOP, SOP (RCV_CNTL[7:0])

done:

return

}

Figure A-3. Receive Packet Status Acquisition

Page 87: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

79

pkbuf_allocate()

{

xbuf_alloc($pop_xfer, 1 lword)

$pop_xfer [0] = buf_pop(PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

while ($pop_xfer [0] == SRAM_DESC_BASE)

$pop_xfer [0] = buf_pop(PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

end while

pkbuf_addr = cal_pkbuf_addr($pop_xfer [0],PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

desc_addr = $pop_xfer

xbuf_free($pop_xfer)

}

Figure A-4. Packet Buffer Allocation

port_rx_fail_error_check()

{

if (exception == PORT_RXFAIL)

inc_rx_fail_count_and_total_discard()

// increment exception counter and behave like EOP, except that the packet is not queued

pkbuf_addr = cal_pkbuf_addr(desc_addr,PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

@req_inflight = SEMAPHORE_OPEN

continue

else if (exception == PORT_RXERROR)

inc_rx_error_count() // increment exception counter

pkbuf_addr = cal_pkbuf_addr(desc_addr,BUFFE_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)

@req_inflight = SEMAPHORE_OPEN

continue

end if

@req_inflight = SEMAPHORE_OPEN

return

}

Figure A-5. Port Fail/Error Check

Page 88: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

80

//********************* Format of Ethernet/802.3 ***************

*Note: min 46bytes and max 1500bytes in Ethernet

//***************************************************************

get_mpkt_header()

{

xbuf_alloc($pkt_buf, 4 lwords) // $pkt_buf[0]-[3] for 16bytes header

xbuf_alloc($pkt_buf_eth , 2 lwords) // for SNAP header

xbuf_link($pkt_buf, $pkt_buf_eth )

copy $pkt_buf <- RFIFO(addr(rfifo_addr + QWOFFSET0), size(3quadwords))

#if little endian

sa01 = $pkt_buf [1] >> 16

#else

sa01 = 0 + $pkt_buf[1](LS16bit) // for later merge

#end if

extract proto_len <- $pkt_buf [3](addr(BYTEOFFSET0 + 12), size(2bytes))

return

}

Figure A-6. MAC Packet Header Acquisition

Destination Address

Source Address 0:31

Source Address 0:15 16:31

Destination Address 0:31

EtherType/Length 0:15

Data*/LLC

Data*/LLC

FCS 0:31

Long word1

Long word0

Long word2

Long word3

Long word 4-14(min)* Long word 15(min)*

Page 89: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

81

parse_packet()

{

ethertype = 0

if proto_len < 1500 // 802.3(length)

extract eth_llc1 <- $pkt_buf (addr(BYTEOFFSET0 + 14), size(1byte))

extract eth_llc2 <- $pkt_buf (addr(BYTEOFFSET0 + 15), size(1byte))

eth_llc1 = eth_llc1 & eth_llc2

extract eth_llc2 <- $pkt_buf (addr(BYTEOFFSET0 + 16), size(1byte))

if eth_llc1 == 0xAA

if eth_llc2 == 0x03

extract ethertype <- $pkt_buf (addr(BYTEOFFSET0 + 17), size(3bytes))

end if

end if

if ethertype > 0

pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_LLCSNAP)

else

pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_LLC)

end if

else // Ethernet(type)

pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_ETHERNET)

ethertype = proto_len

end if

xbuf_free($pkt_buf_eth )

return

}

Figure A-7. Parse Packet

Page 90: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

82

ethertype_classifier()

{

if (ethertype == 0x0800) // Internet Protocol(IP)

IP_forwarder ()

else if (ethertype == 0x0805) // X.25

X25_forwarder ()

else if (ethertype == 0x0806) // Address Resolution Protocol(ARP)

ARP_forwarder ()

else if (ethertype == 0x8137) // IPX

IPX_forwarder ()

else if (ethertype == 0x809B) // Appletalk over Ethernet

Appletalk_forwarder ()

end if

return

}

Note: This code is not included in L2L3fwd16

Figure A-8. Ethertype Field Classifier

ether_filter(ethertype,$pkt_buf)

{

pkt_state = 0, pkaction = 0

rec_port_num = rec_req & 0x1F

ether_port_info(rec_port_num )

xbuf_alloc($$dxfer, 8)

xbuf_link($$dxfer, $$dxfer)

xbuf_alloc($hash_buf, 4 lwords)

//in_port_filter_type - L2 filtering options

// 00 Ethertype based filtering

// 01 Explict rule, action specified in SDRAM fwd entry

// 10 Positive filtering, action implied by presense/absense of filter rule

// 11 Negative filtering, action implied by presense/absense of filter rule

// setup for da hash

extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 0), size(2bytes)) // load DA bytes 0-1

$hash_buf [0] = ethfilt_tempa

Page 91: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

83

extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 2), size(4bytes)) // load DA bytes 2-5

$hash_buf [1] = ethfilt_tempa

extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 6), size(2bytes)) // load SA bytes 0-1

$hash_buf [2] = ethfilt_tempa

extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 8), size(4bytes)) // load SA bytes 2-5

$hash_buf [3] = ethfilt_tempa

hash2_48($hash_buf) // two 48-bit Hash operation for DA and SA

hash0 = $hash_buf [0]

hash1 = $hash_buf [1]

hash2 = $hash_buf [2]

hash3 = $hash_buf [3]

// Hash Table Level 1 lookup

ethfilt_tempb = SRAM_L1_ADDR_HASH_BASE

ethfilt_tempa = 0 + (hash1, byte_enable(0011)) // table index

copy $hash_buf [0 ] <- SRAM(addr(ethfilt_tempa + ethfilt_tempb), size(1longword)) // lookup DA

ethfilt_tempa = 0 + (hash3, byte_enable(0011)) // table index

copy $hash_buf [1] <- SRAM(addr(ethfilt_tempa + ethfilt_tempb), size(1longword))// lookup SA

da_lookup_result = $hash_buf [0] // check results of DA lookup

sa_lookup_result = $hash_buf [1] // check results of SA lookup

xbuf_free($hash_buf)

// Hash Table Level2 lookup

ethfilt_tempb = 0x1 & (da_lookup_result >> 31) // Check for collision bit

if (ethfilt_tempb == 1)

da_lookup_result = hash_resolve(hash0,hash1,SRAM_L2_ADDR_HASH_BASE) // get da_lookup_result on L2

lookup

end if

ethfilt_tempb = 0x1 & (sa_lookup_result >> 31) // Check for collision bit

if (ethfilt_tempb == 1)

sa_lookup_result = hash_resolve(hash2,hash3,SRAM_L2_ADDR_HASH_BASE) // get sa_lookup_result on L2

lookup

end if

if (da_lookup_result == 0) // MAC entry no exist

da_port_num = DEST_PORT_NO_MATCH

else

Page 92: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

84

ethfilt_tempb = 0xfffffff & da_lookup_result // SDRAM index

da_lookup_result = ethfilt_tempb

copy $$dxfer <- SDRAM(addr(0 + da_lookup_result), 2 quadwords) // read the forwarding table for DA

ethfilt_tempa = 0

ethfilt_tempa = FORWARD_ENTRY_MASK & ($$dxfer >> FORWARD_ENTRY_SHIFT_SIZE)

if (ethfilt_tempa) // If forwarding information is associated with this entry

da_port_num = $$dxfer[0] & 0x1F// isolate DA port number from forwarding table

Extract (dst_port_entry) <- ether_port_info(da_port_num ) // Get port information of the destination port

end if

end if

src_port_ethertype = 0 + ((src_port_entry >> 8), byte_enable(0011))

dst_port_ethertype = 0 + ((dst_port_entry >> 8), byte_enable(0011))

src_port_filtertype = 0x03 & (src_port_entry >> 4)

dst_port_filtertype = 0x03 & (dst_port_entry >> 4)

pkaction = PKT_PERMIT // Default action

if (((src_port_filtertype == FILT_TYPE_ETHERTYPE) || (dst_port_filtertype == FILT_TYPE_ETHERTYPE)) &&

(da_port_num != DEST_PORT_NO_MATCH))

ethfilt_tempa = ethertype & src_port_ethertype

if (ethfilt_tempa != dst_port_ethertype)

pkaction = PKT_DENY

goto filter_return

end if

end if

if ((src_port_filtertype == FILT_TYPE_EXPLICT) && sa_lookup_result) // SA filter

ethfilt_tempb = 0xfffffff & sa_lookup_result // SDRAM index

sa_lookup_result = ethfilt_tempb

copy $$dxfer <- SDRAM(addr(0 + sa_lookup_result), size(2quadwods)) // read the forwarding table for SA (4

longwords)

pkaction = BR_FILTER_ACTION_MASK & ($$dxfer >> FILTER_ACTION_SA_SHIFT_SIZE)

if (pkaction == PKT_DENY)

goto filter_return

end if

end if

if ((dst_port_filtertype == FILT_TYPE_EXPLICT) && da_lookup_result) // DA filter

Page 93: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

85

pkaction = BR_FILTER_ACTION_MASK & ($$dxfer >> FILTER_ACTION_DA_SHIFT_SIZE)

if (pkaction == PKT_DENY)

goto filter_return

end if

end if

if ((src_port_filtertype == FILT_TYPE_POSITIVE) || (dst_port_filtertype == FILT_TYPE_POSITIVE))

// Positive filtering - default action is to permit. SA/DA entry in the table will be dropped.

//DA filter

if (da_lookup_result)

pkaction = PKT_DENY

goto filter_return

end if

// SA filter

if (sa_lookup_result)

pkaction = PKT_DENY

goto filter_return

end if

end if

if ((src_port_filtertype == FILT_TYPE_NEGATIVE) || (dst_port_filtertype == FILT_TYPE_NEGATIVE))

// Negative filtering - default action is to deny. SA/DA entry in the table will be allowed.

// DA filter

if (!da_lookup_result)

pkaction = PKT_DENY

goto filter_return

end if

// SA filter

if (!sa_lookup_result)

pkaction = PKT_DENY

goto filter_return

end if

end if

filter_return:

}

Note: This filter pseudo code includes Layer2 MAC Protocol filtering and/or Bridging

Page 94: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

86

Figure A-9. Filter

//********************* Format of port table entry *******************

//**********************************************************************

ether_port_info(in_port_no)

{

// save port MAC address

xbuf_alloc($port_inf, 3)

ethport_tempa = in_port_no << BR_PORT_ENTRY_MULTIPLER

ethport_tempb = SRAM_PORT_STATE_BASE

copy $port_inf[0] <- SRAM(addr(ethport_tempb + ethport_tempa), size(3lwords))

out_port_entry = $port_inf [0]

extract out_port_mac_addr32 <- ($port_inf (addr(BYTEOFFSET0 +4), size(4bytes))

extract out_port_mac_addr16 <- ($port_inf (addr(BYTEOFFSET0 +10), size(2bytes))

xbuf_free($port_inf)

return

}

Figure A-10. Port information Acquisition for Filter

unsed ether unsed filter port

31:24 23:8 7:6 5:4 3:0 Type type state

MAC Address

MAC Address

15:0

Long word0

Long word1

Long word2

Long word3

Page 95: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

87

//************************* Format of IPv4 header ************************

//***********************************************************************

Get_IP_heade r()

{

xbuf_alloc($pkt_buf_ip , 4 lwords)

xbuf_link($pkt_buf, $pkt_buf_ip )

xbuf_link($pkt_buf_ip , $pkt_buf)

copy $pkt_buf_ip[0] <- RFIFO(addr(rfifo_addr + QWOFFSET2), size(3quadwords))

return

}

Figure A-11. IP Header Acquisition

IP_version_check()

{

if (bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE)

extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET14), size(1bytes))

else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) == TRUE)

extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET17), size(1bytes))

else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)

extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET22), size(1bytes))

// Save SA 2-5 as the packet wraps and overwrites pkt_buf0-1

extract tempa <- $pkt_buf (addr(BYTEOFFSET0 + 8), size(4bytes)) // Save SA bytes 2-5 as the following rfifo_read

overwrites

extract tempb <- $pkt_buf (addr(BYTEOFFSET0 + 12), size(4bytes)) // Save len/ssap/dsap as the following rfifo_read

overwrites

version HLen

0:3 4:7 8:15 TOS Lengths

16:31 Ident 0:15 16:18

Flags Offset 19:31

Checksum Protocol TTL 0:7 8:15 16:31

Source Address

Destination Address 0:31

0:31 Options(variable) Pad(variable)

Long word0

Long word1

Long word2

Long word3

Long word4

Long word5

Page 96: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

88

extract tempc <- $pkt_buf (addr(BYTEOFFSET0 + 16), size(4bytes)) // Save CTL/OUI as the following rfifo_read

overwrites

copy $pkt_buf[2] <- RFIFO(addr(rfifo_addr + QWOFFSET5), size(1quadword))

end if

return

}

Figure A-12. IP Version Check

xferpayload_&_iphdrchck_&_modify()

{

if (bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE)

copy RFIFO(addr(rfifo_addr + QWOFFSET4), size(4quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET4))

exception = ip_verify($pkt_buf, BYTEOFFSET14)

ip_modify($$dxfer, BYTEOFFSET14, $pkt_buf, BYTEOFFSET14)

extract ip_dest <- $pkt_buf (addr(BYTEOFFSET14 + 16), size(4bytes))

else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) == TRUE)

copy RFIFO(addr(rfifo_addr + QWOFFSET4), size(4quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET4))

exception = ip_verify($pkt_buf, BYTEOFFSET17)

ip_modify($$dxfer, BYTEOFFSET17, $pkt_buf, BYTEOFFSET17)

$$dxfer[3] = $pkt_buf [3]

extract ip_dest <- $pkt_buf (addr(BYTEOFFSET17 + 16), size(4bytes))

else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)

copy RFIFO(addr(rfifo_addr + QWOFFSET5), size(3quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET5))

exception = ip_verify($pkt_buf, BYTEOFFSET22)

ip_modify($$dxfer, BYTEOFFSET22, $pkt_buf, BYTEOFFSET22)

extract ip_dest <- $pkt_buf (addr(BYTEOFFSET22 +16), size(4bytes))

end if

xbuf_free($pkt_buf_ip ) // release $pkt_buf_ip assigned by get_IP_header

return

}

Figure A-13. IP Header Check & Modify

Page 97: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

89

ip_verify($pkt_buf, BYTEOFFSET)

{

total_len_verify:

extract total_len <- $pkt_buf (addr(BYTEOFFSET+2), size(2bytes))

if ((total_len– 0x14)>0) // at least 20bytes(=IP header length)

goto ttl_verify

else

exception = IP_BAD_TOTAL_LENGT H

goto end

end if

ttl_verify:

extract ttl <- $pkt_buf (addr(BYTEOFFSET+8), size(1byte))

If (ttl > 0) // at least greater than 1

exception = 0

goto cksum_verify

else

exception = IP_BAD_TTL

goto end

end if

cksum_verify:

exception = ip_cksum_verify($pkt_buf, addr(BYTEOFFSET+10), size(2bytes))

if (exception = 0)

goto end

else

exception = IP_BAD_CHECKSUM

end if

end:

return(exception)

}

Figure A-14. IP verify

Page 98: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

90

ip_modify($$dxfer, IPHDR_WR_BYTEOFFSET, $pkt_buf, IPHDR_RD_BYTEOFFSET)

{

xbuf_xfer_set($pkt_buf, IPHDR_RD_START_BYTE) // define as $pkt_buf [0:7]

xbuf_xfer_set($$dxfer, IPHDR_WR_START_BYTE) // define as $$dxfer [0:7]

// alignment check

RD_align = read_align_check (IPHDR_RD_BYTEOFFSET & 0x3)

WR_align = write_align_check (IPHDR_WR_BYTEOFFSET & 0x3)

#if (RD_align != WR_align)

display assembler error

#else

#if (RD_align == 0)

$$dxfer[0] = $pkt_buf[0]

$$dxfer[1] = $pkt_buf[1]

temp = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET0, size(1byte)) // ttl = ttl -1

$$dxfer[2] = ip_cksum_modify(temp, BYTEOFFSET2, size(2byte))

$$dxfer[3] = $pkt_buf[3]

$$dxfer[4] = $pkt_buf[4]

#elif (RD_align == 1)

$$dxfer[0] = $pkt_buf[0]

$$dxfer[1] = $pkt_buf[1]

temp = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET1, size(1byte)) // ttl = ttl –1

$$dxfer[2:3] = ip_cksum_B3align_modify(temp, $pkt_buf[3], size(2byte)) // because of ttl decr

$$dxfer[4] = $pkt_buf[4]

#elif (RD_align == 2)

$$dxfer[0] = $pkt_buf[0]

$$dxfer[1] = $pkt_buf[1]

$$dxfer[2] = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET2, size(1byte)) // ttl = ttl -1

$$dxfer[3] = ip_cksum_modify($pkt_buf[3], BYTEOFFSET0, size(2byte)) // because of ttl decr

$$dxfer[4] = $pkt_buf[4]

#elif (RD_align == 3)

$$dxfer[0] = $pkt_buf[0]

$$dxfer[1] = $pkt_buf[1]

$$dxfer[2] = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET3, size(1byte)) // ttl = ttl –1

$$dxfer[3] = ip_cksum_modify($pkt_buf[3], BYTEOFFSET2, size(2byte)) // because of ttl decr

Page 99: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

91

$$dxfer[4] = $pkt_buf[4]

#endif

#endif

return

}

Figure A-15. IP Modify

pk_late_discard(rec_req, exception)

{

// description: Increment exception counter, total discards, set discard flag

tempa = EXCEPTION_COUNTERS

tempa = (tempa + rec_req << 4), bit_enable(LS8bit))

increment 1 Scratchpad(addr(tempa + exception))

tempa = TOTAL_DISCARDS

increment 1 Scratchpad(addr(tempa))

rec_state = rec_state | 1 << /**/REC_STATE_DISCARD_BIT/**/ // set discard

return(rec_state)

}

Figure A-16. Packet Discard

ip_trie5_lookup(ip_dest, SRAM_ROUTE_LOOKUP_BASE)

{

tables_base = SRAM_ROUTE_LOOKUP_BASE

temp_base2 = tables_base + (1 << 16) //add 0x10000, 256 entry table

temp_base3 = temp_base2 + (1 << 8) // add 0x100, multiple 16 entry tables

offset = ip_dest >> 16 // form offset from 31:16

first_lookup:

copy $rd_xfer0 <- SRAM(addr(tables_base + offset), size(1 lword)) // direct lookup off addr 31:16

offset = ip_dest >> 24 // form offset from 31:24

copy $rd_xfer1 <- SRAM(addr(temp_base2 + offset), size(1 lword)) // direct lookup off addr 31:24

prev_rt_long = 0

lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))

if (lookup_short == 0)

Page 100: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

92

goto long_path_only

else

lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))

if (lookup_long == 0)

goto short_path_only

else

goto both_path

end if

short_path_only:

second_lookup_short:

next_trie(ip_dest, 20, prev_rt_short, lookup_short, $rd_xfer1 , temp_base3)

if (lookup_short == 0)

goto set_route_ptr

end if

third_lookup_short:

next_trie(ip_dest, 16, prev_rt_short, lookup_short, $rd_xfer1 , temp_base3)

if (lookup_short == 0)

goto set_route_ptr

else

goto set_route_ptr

end if

long_path_only:

lookup_long = 0 + ($rd_xfer0, byte_enable(0011)

if (lookup_long == 0)

goto set_route_ptr

end if

second_lookup_long:

next_trie(ip_dest, 12, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)

if (lookup_long == 0)

goto set_route_ptr

end if

third_lookup_long:

Page 101: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

93

next_trie(ip_dest, 8, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)

if (lookup_long == 0)

goto set_route_ptr

end if

fourth_lookup_long:

next_trie(ip_dest, 4, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)

if (lookup_longer == 0)

goto set_route_ptr

end if

fifth_lookup_long:

next_trie(ip_dest, 0, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)

if (lookup_longer == 0)

goto set_route_ptr

end if

both_paths:

lookup_short = ((lookup_short + (ip_dest >> 20)), bit_enable(LS4bit))

copy $rd_xfer1 <- SRAM(addr(temp_base3 + lookup_short), size(1 lword))

prev_rt_short = 0 + ($rd_xfer1 , byte_enable(1100))

lookup_long = (lookup_long + (ip_dest >> 12), bit_enable(LS4bit))

copy $rd_xfer0 <- SRAM(addr(temp_base3 + lookup_long), size(1 lword))

prev_rt_long = 0 + ($rd_xfer0 , byte_enable(1100))

lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))

if (lookup_long == 0)

goto second_both_no_long

end if

second_both_long:

lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))

if (lookup_short == 0)

goto third_lookup_long

else

goto third_lookup_both

end if

second_both_no_long:

Page 102: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

94

lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))

if (lookup_short == 0)

goto set_route_ptr

else

goto third_lookup_short

end if

third_lookup_both:

lookup_short = ((lookup_short + (ip_dest >> 16)), bit_enable(LS4bit))

copy $rd_xfer1 <- SRAM(addr(temp_base3 + lookup_short), size(1 lword))

prev_rt_short = 0 + ($rd_xfer1 , byte_enable(1100))

lookup_long = ((lookup_long + (ip_dest >> 8)), bit_enable(LS4bit))

copy $rd_xfer0 <- SRAM(addr(temp_base3 + lookup_long), size(1 lword))

prev_rt_long = 0 + ($rd_xfer0 , byte_enable(1100))

lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))

if (lookup_long == 0)

goto set_route_ptr

else

goto fourth_lookup_long

end if

set_route_ptr:

rt_ptr = $rd_xfer0 >> 17 // long match

if (rt_ptr!= 0)

goto end

end if

rt_ptr = prev_rt_long >> 17 // long match at previous trie

if (rt_ptr!= 0)

goto end

end if

rt_ptr = $rd_xfer1 >> 17 // short match

if (rt_ptr!= 0)

goto end

end if

rt_ptr = prev_rt_short >> 17 // short match at previous trie

Page 103: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

95

end:

return(rt_ptr)

}

Figure A-17. Trie Lookup next_trie(ipaddr, SHIFT_AMT, prevout_rt_ptr, lookup , $xfer, trie_base)

{

lookup = 0 + ((lookup + (ipaddr >> SHIFT_AMT)), bit_enable(LS4bit))

copy $xfer <- SRAM(addr(trie_base + lookup), size(1 lword))

prevout_rt_ptr = 0 + ($xfer, byte_enable(1100))

lookup = 0 + ($xfer, byte_enable(0011))

return(lookup)

}

Figure A-18. Next_Trie_Search for Trie Lookup

write_modified_IP_Ether_header()

{

if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)

copy $$dxfer[0] -> DRAM(addr(pkbuf_addr + QWOFFSET4), size(1quadwords))

end if

// $$dxfer0 = output port, $$dxfer1 = MAC DA bytes 0-3, $$dxfer2 = MAC DA bytes 4-5

output_intf = $$dxfer[0] << 3 // save for enqueue

$$dxfer[0] = $$dxfer[1] // new DA bytes 0-3

#ifdef LITTLE_ENDIAN

$$dxfer[1] = $$dxfer[2] + (sa01 << 16)// merge new DA 4-5 with SA 0-1

#else

$$dxfer[1] = sa01 + ($$dxfer[2] << 16)// merge new DA 4-5 with SA 0-1

#endif

if ((bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE) || (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) ==

TRUE))

$$dxfer[2] = $$pkt_buf[2] // previous SA bytes 2-5

else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)

$$dxfer[2] = $$tempa // previous SA bytes 2-5

Page 104: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

96

$$dxfer[3] = $$tempb // len/ssap/dsap

$$dxfer[4] = $$tempc // CTL/OUI

end if

copy $$dxfer[0:7] -> DRAM(addr(pkbuf_addr + QWOFFSET0), size(4 quadwords)) // write modified packet

return

}

Figure A-19. Write Modified IP and Ether Header

tx_assignment_read(@assign#)

{

wait_for_assignment:

if (@assign# < 0)

goto wait_for_assignment

else

port = (@assign#, bit_enable(LS4bit))

skip_flag = @assign# & skip_bit_on // skip_bit_on is set in initialization

tfifo_entry = (port , bit_enable(LS4bit))

q_offset = port << 4

end if

return

}

Figure A-20. Transmit Assignment Read

Page 105: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

97

//****************** Format of Queue Descriptor *********************

//********************Format of Packet Link List ********************

//*******************************************************************

tx_pktlinklist_read(q_desc_base, q_offset, buf_desc_base)

{

copy $q_desc0 <- SRAM(addr(q_desc_base + q_offset), size(2 lwords)) with lock

// lock and read the queue descriptor 2 longwords // unlocked by tx_pktlink_update

tmp_head_ptr = q_desc0 >> 16 // isolate head ptr

buf_offset = ~0x7 & q_desc0 >> 13 // isolate next packet link and

//mult by 8 to get relative address

copy $pkt_link0 <- SRAM(addr(buf_desc_base + tmp_head_ptr), size(2 lwords))

//read packet_link 2 longwords get next head, status

tail_ptr = 0 + (q_desc0, byte_enable(0011))

ele_remaining = 0 + ($pkt_link1 >> DESC1_ELE_COUNT0, byte_enable(0001))

last_mpkt_byte_cnt = 0x3f & ($pkt_link1 >> DESC1_PKT_END_BYTE8)

bank = bit20on & ($pkt_link1 >> 4)

}

Figure A-21. Transmit Packet Link List Read

tx_packetlinklist_update($q_desc0, $q_desc1, q_desc_base, tail_ptr, q_offset, $pkt_link0, port)

{

q_pkt_cunt = $q_desc1 -1 // decrement the element count

if (q_pkt_cunt > 0)

goto packets_remaining

else

Q_desc0 tail_ptr

Q_desc1

head_ptr 31:16 15:0

Packet Count 31:0

Next Packet Link pkt_link0

31:0 RCV_port Freelist Pkt_start pkt_end ele_count pkt_link1

31:27 26:24 23:16 15:8 7:0 _byte _byte

Page 106: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

98

tx_portvector_clear(port, pwp_addr)

end if

packets_remaining:

tail_ptr = (tail_ptr,byte_enable(0011)) + ($pkt_link0 << 16, byte_enable(1100))

$q_desc0 = tail_ptr

$q_desc1 = 0 + (q_pkt_cunt, byte_enable(0011))

copy $q_desc0 -> SRAM(addr(q_desc_base + q_offset), size(2 lwords)) and unlock // locked by tx_linklist

}

Figure A-22. Transmit Packet Link List Update

tx_portvector_clear(port, pwp_addr)

{

tpc_temp = (1 << 5) – port // indirect shift left 32 - portnum

$xfer_reg = 1 << tpc_temp `

clear bit Scratch(addr(pwp_addr), bit position($xfer_reg)) //clear bit for this port

}

Figure A-23. Transmit Port Vector clear

tx_last_mpkt_xfr(bank , buf_offset, last_mpkt_byte_cnt,tfifo_entry, pkt_buffer_base)

{

qw_offset = bank + (buf_offset << 3)

indirect = 0x7 & (last_mpkt_byte_cnt >> 3) //divide by 8 for conversion to quadwords

indirect = bit20_15on | indirect << 16 //place quadword count in 19:16

copy SDRAM(addr(pkt_buffer_base + qw_offset), size(8qwords)) -> tfifo(indirect | tfifo_entry << 7)

}

Figure A-24. Last Packet Transfer

Page 107: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

99

//****************** Format of TFIFO Control Field *********************

//***********************************************************************

tx_status_set(last_mpkt_byte_cnt, BITS_TO_SET, port)

{

temp = BITS_TO_SET | last_mpkt_byte_cnt << 2 // ex) 16 elements count OR EOP_AND_SOP = 3

$tfifo_ctl_wd0 = port | temp << 8

}

Figure A-25. Set Transmit Control Word tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)

{

tfifo_status_write(tfifo_entry, $tfifo_ctl_wd0)

xmit_ptr_wait:

copy $xmit_ptr <- CSR_XMIT_PTR

copy $tx_rdy_copy <- CSR_XMIT_RDY_LO

temp_reg = $xmit_ptr - tfifo_entry

if (temp_reg =0)

goto port_wait_loop

end if

if (temp_reg > 0)

goto ptr_wrapped // xmit ptr > t_fifo_entry -> wrap condition

end if

temp_reg = temp_reg + 5

if (temp_reg >=0)

goto port_wait_loop

else

goto xmit_ptr_wait // the xmit_ptr is not close enough yet

end if

ptr_wrapped:

31:19 18 17 16 15:13 12:10 9 8 7 6:4 3:0

RES Tx Tx Pre # Valid EOP SOP Skip mac Port

Err asis pnd qwds bytes

Page 108: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

100

temp_reg = temp_reg –11

if (temp_reg < 0)

goto xmit_ptr_wait // the xmit_ptr is not close enough yet

end if

port_wait_loop:

if ((1 & $tx_rdy_copy >> tfifo_entry) > 0)

return_status = PASS

goto write_validate

else

$tfifo_ctl_wd0 = tfifo_entry | 1 << 7 // set skip bit

tfifo_status_write(tfifo_entry, $tfifo_ctl_wd0)

return_status = FAIL

write_validate:

tfifo_validate_write(tfifo_entry, in_bit15on)

return

}

Figure A-26. TFIFO Validate

tx_portvect_modify(@local_pwp , port, IN_VALUE)

{

hold_it = (1 << 5) - port

hold_it = 1 << hold_it

#if (IN_VALUE == 1) // if set bit

@local_pwp = @local_pwp | hold_it

#else // IN_VALUE == 0 (clear bit)

@local_pwp = @local_pwp & ~(hold_it)

#endif

}

Figure A-27. Transmit Port Vector Modify

Page 109: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

101

tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8)

{

qw_offset = bank + (buf_offset << 3)

indirect = bit20_15on | 7 << 16 // place quadword count 7 in 19:16

indirect = indiret | tfifo_entry << 7 //put element no. in 10:7

copy SDRAM(addr(pkt_buffer_base + qw_offset), size(8bytes)) -> t_fifo(indirect)

}

Figure A-28. Transmit Packet Transfer

Page 110: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

102

Appendix B: Microengine Instruction Set

Table B-1. Microengine Instruction Set

Instruction Description

Arithmetic,Rotate, and Shift

Instructions

alu

Perform an ALU operation on one or two operands and deposit the result into

the destination register. Update all ALU condition codes according to the

result of the operation. Condition codes are lost during context swaps. The

sign condition code is not valid on underflow or overflow conditions.

alu_shf

Perform an ALU operation on one or two operands and deposit the result into

the destination register. The B operand is shifted or rotated prior to the ALU

operation. Update all ALU condition codes according to the result of the

operation. Condition codes are lost during context swaps. The sign condition

code is not valid on underflow or overflow conditions.

dbl_shf

Load a destination register with a 32-bit longword that is formed by

concatenating the A operands and B operands together, right shifting the

64-bit quantity by the specified amount, and then storing the lower 32 bits

Branch and Jump Instructions

br Branch unconditionally

br=0, br!=0, br>=0, br>=0, br<0,

br<=0, br=cout, br!=cout

Branch to an instruction at a specified label based on an ALU condition code.

The ALU condition codes are Sign, Zero, and Carryout (cout). The sign

condition code is not valid on underflow or overflow conditions.

br_bset, br_bclr Branch to the instruction at the specified label when the specified bit of the

register is clear or set. These instructions set the condition codes.

br=byte, br!=byte

Branch to the instruction at the specified label if a specified byte in a longword

matches or mismatches the byte_compare_value. The br=byte instruction

prefetches the instruction for the “branch taken” condition rather than the

next sequential instruction. The br!=byte instruction prefetches the next

sequential instruction. These instruction set the condition codes.

br=ctx, br!=ctx Branch to the instruction at the specified label based on whether or not the

current context is the specified context number.

Page 111: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

103

Instruction Description

br_inp_state

Branch if the state of the specified state name is set to1. A state is set to 1 or 0

by a functional unit in the IXP1200 and indicates the currently processing

state. It is available to all microengines.

br_!signal Branch if the specified signal is deasserted. If the signal is asserted, clear the

signal an do not branch.

Jump Unconditional branch to an address that is formed during runtime execution

by the addition of the register and label# values.

rtn

Unconditional branch to the address contained in the lower 10bits of the

specified register(address 0 through 1023). Typically used to return from a

branch or jump instruction.

Reference Instructions

csr

Issue a read or write operation to the specified control/status register(CSR).

Transfers exactly one 32-bit register value to or from the specified SRAM

transfer register.

fast_wr

Write the specified immediate data to the specified FBI CSR. A fast write

operation has the write data specified directly in the instruction rather than in

a transfer register. This improves performance by eliminating the need for the

FBI Unit to pull the data from a transfer register. The FBI Unit automatically

shifts the immediate data into the appropriate register field corresponding to

the thread that is writing the FAST_WR data.

local_csr_rd

Read th e specified 16-bit microengine CSR register. The 16-bit read data is

accessed by replacing the immediate data source operand of the next

instruction with the microengine CSR read data. If the very next instruction

does not contain an immediate data source operand field, then the opportunity

to accesss the CSR data read from the previous instruction is lost. A

local_csr_rd or local_csr-wr instruction must not immediately follow or

proceed a local_csr_wr instruction

local_csr_wr

Write specified microengine CSR register with the lower 16 bits of the

specified source register. Unlike normal GPR registers, no built in bypasses

exist in the datapath when reading microengine CSRs immediately after

writing them. Therefore, to compensate for microengine CSR read/write

latency, a local_csr_rd to a given CSR must be at least the third opcode

Page 112: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

104

Instruction Description

following a local_csr_wr to the same CSR in order for CSR read data to reflect

CSR write data. A local_csr_wr instruction must not be placed in the last

deferred window of an instruction. Also, a local_csr_rd or local_csr_wr

instruction must not immediately follow or proceed a local_csr_wr instruction.

r_fifo_rd Issue a read reference from the receive FIFO data and status elements to a

transfer register

pcl_dma Used to issue DMA requests to the PCI Unit. Improved performance can be

achieved if DMA data is located on 64 byte boundaries.

scratch Issue a memory reference to scratchpad memory

sdram Issue a memory reference to SDRAM

sram Issue a memory reference to SRAM, Flash, or Slow Port

t_fifo_wr Issue a write reference from a transfer register data and control/prepend

elements directly to the transmit FIFO

Local Register Instructions

find_bset, find_bset_with_mask

Returns the bit position number of the first set bit in a 16-bit field of a

microengine register. Provides an optional shift control token that enables any

arbitrary 16-bit field to be evaluated. The result of the operation is deposited

into one of two result registers that are not visible to the microengines. The

microengines must explicitly move the contents on the result registers into one

of the microengine GPR or transfer registers via the load_bset_result1 and

load_bset_result2.

immed

Load immediate 16-bits into the specified register. The immediate data must

be specified having the upper 16-bits equal to either all zeros or ones. The

immediate data can be stored in the longword aligned on an 8-bit boundary

based on the optional shift parameter. The fill data is either all zeros or ones

and is based on the specified upper 16-bits.

immed_bo,immed_b1,

immed_b2, immed_b3

If a GPR is specified as the dest_reg, one byte of immediate data is loaded into

the specified byte of the destination while preserving all the other bits of the

destination. These instructions perform a read-modify-write operation on a

specified destination register. If a Transfer register is specified as the dest_reg,

these instructions perform a read and modify from a read transfer register and

write the result into a write transfer register.

Page 113: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

105

Instruction Description

immed_wo, immed_w1

If a GPR is specified as the dest_reg, one word of immediate data is loaded into

the specified word of the destination while preserving all the other bits of the

destination. These instructions perform a read-modify-write operation on a

specified destination register.

ld_field, ld_foeld_w_clr

Load 1 or more bytes within a register with the shifted value of another

operand. Data in the bytes that are not loaded remain unchanged or are

cleared. Ld_field performs a read-modify-write on a destination register.

Ld_field_w_clr performs a write to a destination register. When a transfer

register is used as the destination register, ld_field reads from the read

transfer register and writes the modified data to the write transfer register.

load_addr Load a register with an address of the location specified by label#

load_bset_result1,

load_bset_result2

Load the specified register with the result of a find_bset or

find_bset_with_mask instruction. These instructions set the condition codes. If

the result is 0, then the result register data is invalid and the find_bset

instruction did not detect a set bit. Due to latency issues in the hardware, a

minimum of three microengine cycles (equivalent to three instructions) must

occur between the final find_ bset instruction and the load_bset_result in order

for the result registers to reflect the result of the final find_bset instruction.

After a find_bset or find_bset_with_mask instruction is deposited into a result

register, the result register is validated and locked until it is explicitly cleared

by the user. If the first result register is locked, the second result register will

be loaded and locked when the next set bit is detected. If both result registers

are locked then the result is not reported. The result registers are explicitly

unlocked (or cleared) using the clr_results optional token.

Miscellaneous Instructions

ctx_arb Swap the currently running context out to let another context execute. Wake

up the swapped out context when the specified signal is activated.

nop Consume one microcycle without performing any operation and without

setting any microengine state

hash1_48, hash2_48, hash3_48 Executes one, two, or three 48-bit hash operations

hash1_64, hash2_64, hash3_64 Executes one, two, or three 64-bit hash operations

Page 114: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

106

Appendix C: Instruction Mix Data

Table C-1. Instruction Mix Data for 64bytes packets

Instruction Description uEngine0uEngine1 uEngine2 uEngine3uEngine4uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 40838 40703 40817 41015 37382 37645 163373 75027 238400alu_shf Perform an alu and shift operation 55298 56617 56418 55305 85494 84574 223638 170068 393706

Sub Total 96136 97320 97235 96320 122876 122219 387011 245095 632106Percentage 40.8% 48.2% 43.4%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 61389 60253 59679 60116 63440 69028 241437 132468 373905

br_bset, br_bclr Branch on bit set or bit clear 5536 5835 5833 5827 0 0 23031 0 23031br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 10196 10192 56 20388 20444br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 1643 1572 0 3215 3215rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 67281 66440 65865 66296 75279 80792 265882 156071 421953Percentage 28.0% 30.7% 29.0%Reference Instructionscsr Csr reference 4768 4192 4177 4239 4678 4699 17376 9377 26753fast_wr Write immediate data to thd_done csrs 0 0 0 0 6108 6109 0 12217 12217local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2142 2338 2338 2336 0 0 9154 0 9154pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 871 772 742 500 2405 2319 2885 4724 7609sdram Sdram reference 2430 2724 2723 2721 3059 3057 10598 6116 16714sram Sram reference 7295 8173 8201 8442 7750 7361 32111 15111 24323t_fifo_wr Write to the transmit fifo 0 0 0 0 3087 3094 0 6181 6181Sub Total 17506 18199 18181 18238 27087 26639 72124 53726 125850Percentage 7.6% 10.6% 8.6%Local Register Instructionsfind_bset, find_bset_with_mask Determine position number of first bit set

in an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0

immed Load immediate word and sign extend orzero fill with shift. 23458 22802 22734 22613 55 65 91607 120 91727

immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 15185 16744 16809 17285 6200 5808 66023 12008 78031load_addr Load instruction address. 0 0 0 0 0 0 0 0 0

load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0

Sub Total 38646 39550 39547 39902 6255 5873 157645 12128 169773Percentage 16.6% 2.4% 11.7%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 16403 15382 15437 15162 16306 16407 62384 32713 95097nop Perform no operation. 0 0 0 0 4743 4476 0 9219 9219hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 779 779 779 779 0 0 3116 0 3116hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 17182 16161 16216 15941 21049 20883 65500 41932 107432Percentage 6.9% 8.2% 7.4%TOTAL 236751 237670 237044 236697 252546 256406 948162 508952 1457114

0 0 00 0 0 0Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0

Page 115: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

107

Table C-2. Instruction Mix Data for 594bytes packets

Instruction Description uEngine0uEngine1 uEngine2uEngine3 uEngine4uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 252209 251109 251407 251782 270134 270583 1006507 540717 1547224alu_shf Perform an alu and shift operation 235238 238200 236792 235271 706257 693919 945501 1400176 2345677

Sub Total 487447 489309 488199 487053 976391 964502 1952008 1940893 3892901Percentage 32.5% 51.3% 39.8%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 550680 547273 547142 546994 468202 472093 2192089 940295 3132384

br_bset, br_bclr Branch on bit set or bit clear 19576 19870 19870 19870 0 0 79186 0 79186br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 75932 75603 56 151535 151591br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 35251 33333 0 68584 68584rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 570612 567495 567365 567217 579385 581029 2272689 1160414 3433103Percentage 37.8% 30.7% 35.1%Reference Instructionscsr Csr reference 63518 63387 62803 62725 42576 42764 252433 85340 337773fast_wr Write immediate data to thd_done csrs 0 0 0 0 45549 45159 0 90708 90708local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2144 2340 2340 2340 0 0 9164 0 9164pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 875 776 728 455 8962 8860 2834 17822 20656sdram Sdram reference 9448 9742 9742 9742 22779 22583 38674 45362 84036sram Sram reference 7297 8174 8222 8495 7758 7364 32188 15122 104692t_fifo_wr Write to the transmit fifo 0 0 0 0 26990 26856 0 53846 53846Sub Total 83282 84419 83835 83757 154614 153586 335293 308200 643493Percentage 5.6% 8.2% 6.6%Local Register Instructions

find_bset, find_bset_with_mask Determine position number of first bit setin an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0

immed Load immediate word and sign extend orzero fill with shift. 126959 127183 125967 125538 17457 16647 505647 34104 539751

immed_bo, immed_b1, immed_b2, immed_b3Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 22208 23774 23870 24416 6208 5812 94268 12020 106288load_addr Load instruction address. 0 0 0 0 0 0 0 0 0

load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0

Sub Total 149170 150961 149841 149958 23665 22459 599930 46124 646054Percentage 10.0% 1.2% 6.6%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 213720 211737 211940 211769 104106 104924 849166 209030 1058196nop Perform no operation. 0 0 0 0 59199 56418 0 115617 115617hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 780 780 780 780 0 0 3120 0 3120hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 214500 212517 212720 212549 163305 161342 852286 324647 1176933Percentage 14.2% 8.6% 12.0%TOTAL 1505011 1504701 1501960 1500534 1897360 1882918 6012206 3780278 9792484

Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0 0 0 00 0 0 0

Page 116: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

108

Table C-3. Instruction Mix Data for 1518bytes packets

Instruction Description uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 650125 648994 649386 650209 720363 714363 2598714 1434726 4033440alu_shf Perform an alu and shift operation 442283 454063 453296 454855 1661437 1676711 1804497 3338148 5142645

Sub Total 1092408 1103057 1102682 1105064 2381800 2391074 4403211 4772874 9176085Percentage 30.3% 50.9% 38.4%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 1437275 1449626 1450082 1450071 1189315 1175727 5787054 2365042 8152096

br_bset, br_bclr Branch on bit set or bit clear 46780 33261 33303 33710 0 0 147054 0 147054br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 208426 207095 56 415521 415577br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 65712 68952 0 134664 134664rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 1484411 1483239 1483738 1484134 1463453 1451774 5935522 2915227 8850749Percentage 40.8% 31.1% 37.0%Reference Instructionscsr Csr reference 121032 118756 118141 120411 109475 104804 478340 214279 692619fast_wr Write immediate data to thd_done csrs 0 0 0 0 125829 124251 0 250080 250080local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2166 2354 2356 2360 0 0 9236 0 9236pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 884 784 784 402 22088 21985 2854 44073 46927sdram Sdram reference 16348 16435 16455 16659 62921 62128 65897 125049 190946sram Sram reference 49762 52949 53221 50683 7341 7789 206615 15130 237873t_fifo_wr Write to the transmit fifo 0 0 0 0 73268 73486 0 146754 146754Sub Total 190192 191278 190957 190515 400922 394443 762942 795365 1558307Percentage 5.3% 8.5% 6.5%Local Register Instructionsfind_bset, find_bset_with_mask Determine position number of first bit set

in an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0

immed Load immediate word and sign extend orzero fill with shift. 228378 224671 223414 227183 35737 37709 903646 73446 977092

immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 44599 30559 30590 31575 5876 6152 137323 12028 149351load_addr Load instruction address. 0 0 0 0 0 0 0 0 0

load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0

Sub Total 272980 255234 254008 258762 41613 43861 1040984 85474 1126458Percentage 7.2% 0.9% 4.7%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 597336 596105 596452 596067 288190 285074 2385960 573264 2959224nop Perform no operation. 0 0 0 0 114081 120426 0 234507 234507hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 788 785 785 787 0 0 3145 0 3145hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 598124 596890 597237 596854 402271 405500 2389105 807771 3196876Percentage 16.4% 8.6% 13.4%TOTAL 3638115 3629698 3628622 3635329 4690059 4686652 14531764 9376711 23908475

0 0 00 0 0 0Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0

Page 117: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

109

Table C-4. Instruction Mix Data for Mixture packets

Instruction Description uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 211184 213119 211770 212799 222931 224453 848872 447384 1296256alu_shf Perform an alu and shift operation 163693 162612 162362 161280 545913 534684 649947 1080597 1730544

Sub Total 374877 375731 374132 374079 768844 759137 1498819 1527981 3026800Percentage 31.9% 50.7% 39.2%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 447651 453420 454827 454564 380178 384693 1810462 764871 2575333

br_bset, br_bclr Branch on bit set or bit clear 18291 14090 13762 14011 0 0 60154 0 60154br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 61692 61699 56 123391 123447br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 23872 22618 0 46490 46490rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 466298 467862 468942 468928 465742 469010 1872030 934752 2806782Percentage 39.8% 31.0% 36.4%Reference Instructionscsr Csr reference 31180 27100 29915 28126 34663 35731 116321 70394 186715fast_wr Write immediate data to thd_done csrs 0 0 0 0 37005 37013 0 74018 74018local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2110 2320 2072 2260 0 0 8762 0 8762pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 378 503 390 380 6866 6986 1651 13852 15503sdram Sdram reference 6963 6837 6688 6396 18506 18509 26884 37015 63899sram Sram reference 28808 32195 29592 30886 7902 7238 121481 15140 79402t_fifo_wr Write to the transmit fifo 0 0 0 0 22801 22360 0 45161 45161Sub Total 69439 68955 68657 68048 127743 127837 275099 255580 530679Percentage 5.8% 8.5% 6.9%Local Register Instructions

find_bset, find_bset_with_mask Determine position number of first bit setin an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0

immed Load immediate word and sign extend orzero fill with shift. 66852 60814 64440 61808 11419 11011 253914 22430 276344

immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 25513 21364 19778 20845 6325 5712 87500 12037 99537load_addr Load instruction address. 0 0 0 0 0 0 0 0 0

load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0

Sub Total 92368 82182 84222 82657 17744 16723 341429 34467 375896Percentage 7.3% 1.1% 4.9%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 177274 177985 179111 178960 89367 90537 713330 179904 893234nop Perform no operation. 0 0 0 0 42096 39693 0 81789 81789hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 751 825 736 801 0 0 3113 0 3113hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 178025 178810 179847 179761 131463 130230 716443 261693 978136Percentage 15.2% 8.7% 12.7%TOTAL 1181007 1173540 1175800 1173473 1511536 1502937 4703820 3014473 7718293

Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0 0 0 00 0 0 0

Page 118: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

110

Table C-5. Memory Access per cycle

uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Average64B Reference Instructionscsr Csr reference 0.013912 0.012231 0.012187 0.0123682 0.013649 0.01371 0.01301fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.017821 0.017824 0.017823r_fifo_rd Read the receive fifo 0.00625 0.006822 0.006822 0.0068158 0 0 0.006677scratch Scratchpad reference 0.002541 0.002252 0.002165 0.0014589 0.007017 0.006766 0.0037sdram Sdram reference 0.00709 0.007948 0.007945 0.0079391 0.008925 0.008919 0.008128sram Sram reference 0.021285 0.023847 0.023928 0.0246314 0.022612 0.021477 0.022963t_fifo_wr Write to the transmit fifo 0 0 0 0 0.009007 0.009027 0.009017Total 0.051078 0.0531 0.053047 0.0532134 0.079032 0.077725 0.061199594B Reference Instructionscsr Csr reference 0.0255 0.025448 0.025213 0.0251818 0.017093 0.017168 0.022601fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.018286 0.01813 0.018208r_fifo_rd Read the receive fifo 0.000861 0.000939 0.000939 0.0009394 0 0 0.00092scratch Scratchpad reference 0.000351 0.000312 0.000292 0.0001827 0.003598 0.003557 0.001382sdram Sdram reference 0.003793 0.003911 0.003911 0.0039111 0.009145 0.009066 0.005623sram Sram reference 0.002929 0.003282 0.003301 0.0034104 0.003115 0.002956 0.003166t_fifo_wr Write to the transmit fifo 0 0 0 0 0.010835 0.010782 0.010809Total 0.033435 0.033891 0.033657 0.0336254 0.062072 0.061659 0.0430561518B Reference Instructionscsr Csr reference 0.019214 0.018853 0.018755 0.0191157 0.01738 0.016638 0.018326fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.019976 0.019725 0.019851r_fifo_rd Read the receive fifo 0.000344 0.000374 0.000374 0.0003747 0 0 0.000367scratch Scratchpad reference 0.00014 0.000124 0.000124 6.382E-05 0.003507 0.00349 0.001242sdram Sdram reference 0.002595 0.002609 0.002612 0.0026447 0.009989 0.009863 0.005052sram Sram reference 0.0079 0.008406 0.008449 0.0080461 0.001165 0.001237 0.005867t_fifo_wr Write to the transmit fifo 0 0 0 0 0.011632 0.011666 0.011649Total 0.030194 0.030366 0.030315 0.0302449 0.063648 0.062619 0.041231 MIX Reference Instructionscsr Csr reference 0.015459 0.013436 0.014832 0.0139448 0.017186 0.017715 0.015429fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.018347 0.018351 0.018349r_fifo_rd Read the receive fifo 0.001046 0.00115 0.001027 0.0011205 0 0 0.001086scratch Scratchpad reference 0.000187 0.000249 0.000193 0.0001884 0.003404 0.003464 0.001281sdram Sdram reference 0.003452 0.00339 0.003316 0.0031711 0.009175 0.009177 0.00528sram Sram reference 0.014283 0.015962 0.014672 0.0153132 0.003918 0.003589 0.011289t_fifo_wr Write to the transmit fifo 0 0 0 0 0.011305 0.011086 0.011195Total 0.034428 0.034188 0.03404 0.033738 0.063335 0.063381 0.043851

Page 119: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

111

Appendix D: Latency

0

20

40

60

80

100

15 25 35 45 55 65 75 85 95 105

cycles

cum

ula

tive

per

cen

tage

Microengine0

Microengine1Microengine2Microengine3

Figure D-1. Receive FIFO buffer Latency

0

20

40

60

80

100

0 20 40 60 80 100

cycles

cum

ulat

ive

perc

enta

ge

Microengine4

Microengine5

Figure D-2. Scratchpad RAM Latency

Page 120: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

112

0

20

40

60

80

100

0 20 40 60 80 100 120 140

cycles

cum

ula

tvie

per

cen

tage

Microengine0Microengine1

Microengine2Microengine3Microengine4Microengine5

Figure D-3. FBI CSR Latency

0

20

40

60

80

100

30 40 50 60 70 80

cycles

cum

ula

tive

per

cen

tage

Microengine0

Microengine1Microengine2Microengine3

Figure D-4. Hash unit Latency

Page 121: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

113

Table D-1. SDRAM Latency Data (1)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative43 201 4.3 4.3 234 4.7 4.7 288 4.6 4.6 212 4.2 4.244 204 4.4 8.7 194 3.9 8.6 171 3.4 8 184 3.7 7.945 112 2.4 11 121 2.4 11 143 2.9 10.8 132 2.6 10.646 82 1.8 12.8 88 1.8 12.8 104 2.1 12.9 92 1.8 12.447 54 1.2 13.9 46 0.9 13.7 56 1.1 14.1 70 1.4 13.848 34 0.7 14.7 31 0.6 14.3 35 0.7 14.8 55 1.1 14.949 51 1.1 15.8 62 1.2 15.5 76 1.5 16.3 68 1.4 16.350 43 0.9 16.7 47 0.9 16.5 42 0.8 17.1 46 0.9 17.251 57 1.2 17.9 76 1.5 18 70 1.4 18.5 71 1.4 18.652 41 0.9 18.8 37 0.7 18.7 61 1.2 19.7 45 0.9 19.553 71 1.5 20.3 83 1.7 20.4 84 1.7 21.4 79 1.6 21.154 56 1.2 21.5 52 1 21.4 64 1.3 22.7 52 1 22.155 60 1.3 22.8 79 1.6 23 44 0.9 23.6 81 1.6 23.856 52 1.1 23.9 52 1 24.1 50 1 24.6 51 1 24.857 58 1.2 25.1 67 1.3 25.4 84 1.7 26.3 73 1.5 26.258 44 0.9 26.1 76 1.5 26.9 54 1.1 27.3 56 1.1 27.459 55 1.2 27.2 83 1.7 28.6 76 1.5 28.9 87 1.7 29.160 52 1.1 28.3 42 0.8 29.4 54 1.1 29.9 57 1.1 30.361 67 1.4 29.8 80 1.6 31 69 1.4 31.3 80 1.6 31.962 49 1 30.8 47 0.9 32 50 1 32.3 42 0.8 32.763 66 1.4 32.2 83 1.7 33.6 67 1.3 33.7 83 1.7 34.464 52 1.1 33.3 72 1.4 35.1 65 1.3 35 61 1.2 35.665 89 1.9 35.2 90 1.8 36.9 88 1.8 36.7 85 1.7 37.366 53 1.1 36.4 64 1.3 38.2 66 1.3 38.1 57 1.1 38.467 90 1.9 38.3 79 1.6 39.7 98 2 40 80 1.6 4068 63 1.3 39.6 68 1.4 41.1 59 1.2 41.2 60 1.2 41.269 77 1.6 41.3 84 1.7 42.8 98 2 43.2 94 1.9 43.170 61 1.3 42.6 72 1.4 44.2 61 1.2 44.4 51 1 44.171 66 1.4 44 76 1.5 45.7 60 1.2 45.6 91 1.8 45.972 57 1.2 45.4 57 1.1 46.9 43 0.9 46.4 51 1 4773 88 1.9 47.1 77 1.5 48.4 70 1.4 47.8 81 1.6 48.674 59 1.3 48.4 54 1.1 49.5 55 1.1 48.9 58 1.2 49.775 100 2.1 50.5 86 1.7 51.2 67 1.3 50.3 78 1.6 51.376 49 1 51.5 54 1.1 52.3 56 1.1 51.4 50 1 52.377 68 1.5 53 80 1.6 53.9 89 1.8 53.2 87 1.7 54.178 56 1.2 54.2 64 1.3 55.2 40 0.8 54 51 1 55.179 83 1.8 56 91 1.8 57 85 1.7 55.7 72 1.4 56.580 52 1.1 57.1 65 1.3 58.3 54 1.1 56.8 60 1.2 57.781 72 1.5 58.6 95 1.9 60.2 73 1.5 58.2 82 1.6 59.482 53 1.1 59.7 53 1.1 61.3 62 1.2 59.5 50 1 60.483 70 1.5 61.2 67 1.3 62.6 87 1.7 61.2 78 1.6 61.984 46 1 62.2 64 1.3 63.9 45 0.9 62.1 56 1.1 6385 83 1.8 64 56 1.1 65 63 1.3 63.4 78 1.6 64.686 36 0.8 64.8 54 1.1 66.1 53 1.1 64.4 55 1.1 65.787 62 1.3 66.1 68 1.4 67.5 73 1.5 65.9 69 1.4 67.188 49 1 67.1 42 0.8 68.3 54 1.1 67 43 0.9 67.989 63 1.3 68.5 70 1.4 69.7 66 1.3 68.3 73 1.5 69.490 44 0.9 69.4 41 0.8 70.5 51 1 69.3 42 0.8 70.391 64 1.4 70.8 45 0.9 71.4 55 1.1 70.4 73 1.5 71.792 35 0.7 71.5 46 0.9 72.3 49 1 71.4 30 0.6 72.393 54 1.2 72.7 58 1.2 73.5 67 1.3 72.7 53 1.1 73.494 38 0.8 73.5 45 0.9 74.4 46 0.9 73.7 37 0.7 74.195 63 1.3 74.8 63 1.3 75.7 48 1 74.6 55 1.1 75.2

Microengine0 Microengine1 Microengine2 Microengine3

Page 122: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

114

Table D-1. SDRAM Latency Data (2)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative96 35 0.7 75.6 31 0.6 76.3 43 0.9 75.5 34 0.7 75.997 50 1.1 76.7 46 0.9 77.2 39 0.8 76.3 59 1.2 77.198 35 0.7 77.4 38 0.8 78 38 0.8 77 43 0.9 77.999 43 0.9 78.3 42 0.8 78.8 52 1 78.1 48 1 78.9

100 44 0.9 79.3 25 0.5 79.3 26 0.5 78.6 37 0.7 79.6101 51 1.1 80.4 51 1 80.3 52 1 79.6 49 1 80.6102 25 0.5 80.9 32 0.6 81 38 0.8 80.4 24 0.5 81.1103 42 0.9 81.8 44 0.9 81.8 40 0.8 81.2 29 0.6 81.7104 26 0.6 82.3 36 0.7 82.6 25 0.5 81.7 32 0.6 82.3105 26 0.6 82.9 45 0.9 83.5 39 0.8 82.5 39 0.8 83.1106 20 0.4 83.3 26 0.5 84 31 0.6 83.1 37 0.7 83.8107 35 0.7 84.1 39 0.8 84.8 40 0.8 83.9 38 0.8 84.6108 26 0.6 84.6 31 0.6 85.4 24 0.5 84.4 27 0.5 85.1109 31 0.7 85.3 33 0.7 86 40 0.8 85.2 39 0.8 85.9110 20 0.4 85.7 28 0.6 86.6 28 0.6 85.7 32 0.6 86.6111 33 0.7 86.4 25 0.5 87.1 32 0.6 86.4 28 0.6 87.1112 24 0.5 86.9 27 0.5 87.7 18 0.4 86.7 26 0.5 87.6113 27 0.6 87.5 30 0.6 88.3 32 0.6 87.4 26 0.5 88.2114 13 0.3 87.8 20 0.4 88.7 16 0.3 87.7 26 0.5 88.7115 28 0.6 88.4 31 0.6 89.3 42 0.8 88.5 31 0.6 89.3116 18 0.4 88.8 22 0.4 89.7 20 0.4 88.9 21 0.4 89.7117 33 0.7 89.5 29 0.6 90.3 24 0.5 89.4 27 0.5 90.3118 20 0.4 89.9 15 0.3 90.6 23 0.5 89.9 15 0.3 90.6119 22 0.5 90.4 34 0.7 91.3 24 0.5 90.4 27 0.5 91.1120 18 0.4 90.8 18 0.4 91.6 13 0.3 90.6 20 0.4 91.5121 18 0.4 91.1 21 0.4 92.1 31 0.6 91.2 27 0.5 92.1122 19 0.4 91.5 16 0.3 92.4 19 0.4 91.6 12 0.2 92.3123 32 0.7 92.2 24 0.5 92.9 30 0.6 92.2 19 0.4 92.7124 17 0.4 92.6 11 0.2 93.1 15 0.3 92.5 13 0.3 92.9125 24 0.5 93.1 21 0.4 93.5 23 0.5 93 18 0.4 93.3126 11 0.2 93.3 15 0.3 93.8 13 0.3 93.2 21 0.4 93.7127 24 0.5 93.8 28 0.6 94.4 26 0.5 93.8 24 0.5 94.2128 8 0.2 94 9 0.2 94.5 6 0.1 93.9 11 0.2 94.4129 20 0.4 94.4 19 0.4 94.9 18 0.4 94.2 12 0.2 94.7130 11 0.2 94.7 14 0.3 95.2 8 0.2 94.4 10 0.2 94.9131 14 0.3 95 16 0.3 95.5 20 0.4 94.8 14 0.3 95.1132 11 0.2 95.2 8 0.2 95.7 12 0.2 95 6 0.1 95.3133 14 0.3 95.5 10 0.2 95.9 19 0.4 95.4 16 0.3 95.6134 14 0.3 95.8 6 0.1 96 9 0.2 95.6 13 0.3 95.8135 8 0.2 96 15 0.3 96.3 17 0.3 95.9 12 0.2 96.1136 6 0.1 96.1 6 0.1 96.4 6 0.1 96.1 10 0.2 96.3137 8 0.2 96.3 18 0.4 96.8 13 0.3 96.3 13 0.3 96.5138 12 0.3 96.5 9 0.2 97 8 0.2 96.5 6 0.1 96.7139 7 0.1 96.7 11 0.2 97.2 13 0.3 96.7 8 0.2 96.8140 9 0.2 96.9 2 0 97.2 5 0.1 96.8 10 0.2 97141 3 0.1 96.9 10 0.2 97.4 10 0.2 97 13 0.3 97.3142 8 0.2 97.1 4 0.1 97.5 5 0.1 97.1 10 0.2 97.5143 8 0.2 97.3 5 0.1 97.6 7 0.1 97.3 6 0.1 97.6144 10 0.2 97.5 5 0.1 97.7 3 0.1 97.3 5 0.1 97.7145 4 0.1 97.6 10 0.2 97.9 8 0.2 97.5 5 0.1 97.8146 11 0.2 97.8 5 0.1 98 8 0.2 97.7 5 0.1 97.9147 5 0.1 97.9 7 0.1 98.1 12 0.2 97.9 5 0.1 98148 4 0.1 98 4 0.1 98.2 4 0.1 98 3 0.1 98.1

Microengine0 Microengine1 Microengine2 Microengine3

Page 123: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

115

Table D-1. SDRAM Latency Data (3)

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative149 4 0.1 98.1 8 0.2 98.4 9 0.2 98.2 4 0.1 98.1150 10 0.2 98.3 3 0.1 98.4 1 0 98.2 5 0.1 98.2151 1 0 98.3 5 0.1 98.5 4 0.1 98.3 7 0.1 98.4152 7 0.1 98.5 4 0.1 98.6 5 0.1 98.4 1 0 98.4153 2 0 98.5 3 0.1 98.7 3 0.1 98.4 6 0.1 98.5154 2 0 98.5 4 0.1 98.8 2 0 98.5 1 0 98.5155 6 0.1 98.7 3 0.1 98.8 9 0.2 98.6 5 0.1 98.6156 2 0 98.7 2 0 98.9 3 0.1 98.7 4 0.1 98.7157 4 0.1 98.8 6 0.1 99 5 0.1 98.8 5 0.1 98.8158 2 0 98.8 0 0 99 4 0.1 98.9 1 0 98.8159 5 0.1 98.9 1 0 99 3 0.1 98.9 2 0 98.9160 4 0.1 99 1 0 99 3 0.1 99 3 0.1 98.9161 5 0.1 99.1 3 0.1 99.1 5 0.1 99.1 5 0.1 99162 2 0 99.2 1 0 99.1 4 0.1 99.2 3 0.1 99.1163 1 0 99.2 5 0.1 99.2 3 0.1 99.2 3 0.1 99.2164 2 0 99.2 2 0 99.2 1 0 99.3 4 0.1 99.2165 1 0 99.3 4 0.1 99.3 7 0.1 99.4 4 0.1 99.3166 1 0 99.3 2 0 99.4 0 0 99.4 4 0.1 99.4167 3 0.1 99.3 3 0.1 99.4 1 0 99.4 2 0 99.4168 1 0 99.4 1 0 99.4 1 0 99.4 3 0.1 99.5169 3 0.1 99.4 2 0 99.5 6 0.1 99.6 1 0 99.5170 2 0 99.5 0 0 99.5 0 0 99.6 0 0 99.5171 3 0.1 99.5 2 0 99.5 4 0.1 99.6 4 0.1 99.6172 1 0 99.6 1 0 99.5 0 0 99.6 0 0 99.6173 2 0 99.6 1 0 99.6 2 0 99.7 1 0 99.6174 2 0 99.6 1 0 99.6 1 0 99.7 0 0 99.6175 1 0 99.7 2 0 99.6 2 0 99.7 0 0 99.6176 3 0.1 99.7 1 0 99.6 0 0 99.7 1 0 99.6177 0 0 99.7 2 0 99.7 0 0 99.7 1 0 99.7178 0 0 99.7 1 0 99.7 0 0 99.7 2 0 99.7179 0 0 99.7 1 0 99.7 0 0 99.7 3 0.1 99.8180 0 0 99.7 0 0 99.7 0 0 99.7 1 0 99.8181 0 0 99.7 0 0 99.7 1 0 99.8 0 0 99.8182 0 0 99.7 0 0 99.7 1 0 99.8 0 0 99.8183 0 0 99.7 2 0 99.8 0 0 99.8 2 0 99.8184 0 0 99.7 0 0 99.8 1 0 99.8 0 0 99.8185 1 0 99.7 0 0 99.8 1 0 99.8 0 0 99.8186 1 0 99.8 0 0 99.8 1 0 99.8 0 0 99.8187 1 0 99.8 1 0 99.8 1 0 99.9 0 0 99.8188 0 0 99.8 2 0 99.8 2 0 99.9 0 0 99.8189 1 0 99.8 0 0 99.8 1 0 99.9 1 0 99.8190 0 0 99.8 0 0 99.8 0 0 99.9 0 0 99.8191 1 0 99.8 1 0 99.8 2 0 100 0 0 99.8192 0 0 99.8 0 0 99.8 0 0 100 0 0 99.8193 3 0.1 99.9 2 0 99.9 0 0 100 1 0 99.9194 2 0 99.9 1 0 99.9 0 0 100 1 0 99.9195 0 0 99.9 1 0 99.9 1 0 100 1 0 99.9196 1 0 100 1 0 99.9 0 0 100 2 0 99.9197 1 0 100 0 0 99.9 0 0 100 0 0 99.9198 0 0 100 0 0 99.9 0 0 100 0 0 99.9199 0 0 100 0 0 99.9 1 0 100 0 0 99.9200 0 0 100 0 0 99.9 0 0 99.9201 0 0 100 0 0 99.9 0 0 99.9

Microengine0 Microengine1 Microengine2 Microengine3

Page 124: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

116

Table D-1. SDRAM Latency Data (4)

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative202 0 0 100 0 0 99.9 1 0 100203 0 0 100 0 0 99.9 0 0 100204 0 0 100 0 0 99.9 0 0 100205 0 0 100 0 0 99.9 0 0 100206 0 0 100 0 0 99.9 0 0 100207 0 0 100 0 0 99.9 0 0 100208 0 0 100 0 0 99.9 0 0 100209 0 0 100 0 0 99.9 0 0 100210 0 0 100 0 0 99.9 0 0 100211 1 0 100 0 0 99.9 0 0 100212 0 0 99.9 0 0 100213 0 0 99.9 2 0 100214 1 0 100215 0 0 100216 0 0 100217 0 0 100218 0 0 100219 1 0 100220 1 0 100

Microengine0 Microengine1 Microengine2 Microengine3

Page 125: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

117

Table D-2. SRAM Latency (unlocked) Data (1)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative16 1543 10.5 10.5 1774 10.9 10.9 1842 11.3 11.3 1719 10.6 10.617 1091 7.4 17.9 1309 8.1 19 1319 8.1 19.5 1343 8.3 18.918 594 4 22 668 4.1 23.1 674 4.1 23.6 638 3.9 22.8 818 16.4 16.4 774 16.6 16.619 413 2.8 24.8 481 3 26.1 463 2.9 26.5 477 2.9 25.7 961 19.3 35.7 878 18.8 35.320 1078 7.3 32.2 1095 6.7 32.8 1115 6.9 33.3 1102 6.8 32.5 441 8.8 44.5 349 7.5 42.821 972 6.6 38.8 1014 6.2 39 1065 6.6 39.9 1011 6.2 38.7 343 6.9 51.4 330 7.1 49.922 645 4.4 43.2 795 4.9 43.9 767 4.7 44.6 745 4.6 43.3 332 6.7 58 282 6 55.923 530 3.6 46.8 620 3.8 47.8 593 3.7 48.3 640 3.9 47.3 320 6.4 64.5 278 5.9 61.924 580 4 50.7 666 4.1 51.9 644 4 52.2 686 4.2 51.5 332 6.7 71.1 322 6.9 68.725 481 3.3 54 545 3.4 55.2 529 3.3 55.5 547 3.4 54.9 310 6.2 77.3 336 7.2 75.926 563 3.8 57.8 552 3.4 58.6 546 3.4 58.8 570 3.5 58.4 188 3.8 81.1 220 4.7 80.627 464 3.2 61 488 3 61.6 473 2.9 61.8 515 3.2 61.5 226 4.5 85.6 222 4.7 85.428 455 3.1 64.1 467 2.9 64.5 499 3.1 64.8 471 2.9 64.4 145 2.9 88.5 141 3 88.429 376 2.6 66.7 374 2.3 66.8 391 2.4 67.2 386 2.4 66.8 130 2.6 91.1 128 2.7 91.130 349 2.4 69.1 385 2.4 69.2 389 2.4 69.6 421 2.6 69.4 77 1.5 92.7 75 16 92.731 301 2.1 71.1 323 2 71.1 342 2.1 71.7 330 2 71.4 65 1.3 94 59 1.3 9432 310 2.1 73.2 306 1.9 73 328 2 73.8 343 2.1 73.6 43 0.9 94.8 47 1 9533 218 1.5 74.7 259 1.6 74.6 260 1.6 75.4 271 1.7 75.2 36 0.7 95.6 35 0.7 95.834 230 1.6 76.3 268 1.7 76.3 261 1.6 77 277 1.7 76.9 26 0.5 96.1 22 0.5 96.235 226 1.5 77.8 238 1.5 77.7 200 1.2 78.2 232 1.4 78.4 26 0.5 96.6 22 0.5 96.736 194 1.3 79.1 236 1.5 79.2 236 1.5 79.7 216 1.3 79.7 20 0.4 97 15 0.3 9737 152 1 80.2 170 1 80.2 162 1 80.6 194 1.2 80.9 12 0.2 97.3 23 0.5 97.538 184 1.3 81.4 223 1.4 81.6 184 1.1 81.8 198 1.2 82.1 20 0.4 97.7 8 0.2 97.739 150 1 82.4 149 0.9 82.5 122 0.8 82.5 170 1 83.2 15 0.3 98 15 0.3 9840 162 1.1 83.5 178 1.1 83.6 133 0.8 83.4 146 0.9 84 8 0.2 98.1 8 0.2 98.241 120 0.8 84.4 125 0.8 84.4 145 0.9 84.2 137 0.8 84.9 9 0.2 98.3 6 0.1 98.342 149 1 85.4 139 0.9 85.3 149 0.9 85.2 132 0.8 85.7 9 0.2 98.5 15 0.3 98.643 90 0.6 86 111 0.7 85.9 126 0.8 85.9 106 0.7 86.4 8 0.2 98.6 3 0.1 98.744 132 0.9 86.9 115 0.7 86.6 135 0.8 86.8 118 0.7 87.1 3 0.1 98.7 5 0.1 98.845 106 0.7 87.6 101 0.6 87.3 93 0.6 87.3 109 0.7 87.8 4 0.1 98.8 4 0.1 98.946 114 0.8 88.4 115 0.7 88 94 0.6 87.9 113 0.7 88.5 3 0.1 98.8 5 0.1 9947 67 0.5 88.8 95 0.6 88.6 84 0.5 88.4 97 0.6 89.1 3 0.1 98.9 3 0.1 99.148 97 0.7 89.5 85 0.5 89.1 88 0.5 89 87 0.5 89.6 3 0.1 99 7 0.1 99.249 80 0.5 90.1 84 0.5 89.6 85 0.5 89.5 76 0.5 90.1 4 0.1 99 3 0.1 99.350 86 0.6 90.6 106 0.7 90.3 96 0.6 90.1 97 0.6 90.7 4 0.1 99.1 2 0 99.351 72 0.5 91.1 67 0.4 90.7 65 0.4 90.5 72 0.4 91.1 4 0.1 99.2 2 0 99.452 54 0.4 91.5 70 0.4 91.1 80 0.5 91 77 0.5 91.6 9 0.2 99.4 6 0.1 99.553 62 0.4 91.9 76 0.5 91.6 74 0.5 91.4 72 0.4 92 4 0.1 99.5 1 0 99.554 63 0.4 92.3 85 0.5 92.1 78 0.5 91.9 79 0.5 92.5 3 0.1 99.5 3 0.1 99.655 55 0.4 92.7 66 0.4 92.5 53 0.3 92.2 47 0.3 92.8 6 0.1 99.6 1 0 99.656 71 0.5 93.2 70 0.4 92.9 75 0.5 92.7 84 0.5 93.3 1 0 99.7 4 0.1 99.757 44 0.3 93.5 58 0.4 93.3 51 0.3 93 52 0.3 93.6 2 0 99.7 2 0 99.758 53 0.4 93.9 51 0.3 93.6 45 0.3 93.3 49 0.3 93.9 1 0 99.7 1 0 99.759 31 0.2 94.1 53 0.3 93.9 56 0.3 93.6 35 0.2 94.1 1 0 99.7 2 0 99.860 45 0.3 94.4 62 0.4 94.3 57 0.4 94 46 0.3 94.4 1 0 99.8 3 0.1 99.961 35 0.2 94.6 42 0.3 94.6 49 0.3 94.3 31 0.2 94.6 1 0 99.8 1 0 99.962 54 0.4 95 45 0.3 94.8 46 0.3 94.6 34 0.2 94.8 0 0 99.8 1 0 99.963 28 0.2 95.2 33 0.2 95 36 0.2 94.8 32 0.2 95 3 0.1 99.8 0 0 99.964 36 0.2 95.4 39 0.2 95.3 31 0.2 95 36 0.2 95.2 1 0 99.9 1 0 99.965 32 0.2 95.6 24 0.1 95.4 35 0.2 95.2 31 0.2 95.4 1 0 99.9 0 0 99.966 31 0.2 95.9 33 0.2 95.6 32 0.2 95.4 37 0.2 95.7 1 0 99.9 2 0 10067 19 0.1 96 25 0.2 95.8 32 0.2 95.6 25 0.2 95.8 0 0 99.9 0 0 10068 33 0.2 96.2 35 0.2 96 33 0.2 95.8 26 0.2 96 1 0 99.9 0 0 10069 26 0.2 96.4 33 0.2 96.2 23 0.1 95.9 22 0.1 96.1 0 0 99.9 0 0 10070 24 0.2 96.6 26 0.2 96.4 20 0.1 96.1 31 0.2 96.3 0 0 99.9 0 0 10071 20 0.1 96.7 19 0.1 96.5 31 0.2 96.3 19 0.1 96.4 2 0 100 0 0 10072 20 0.1 96.8 27 0.2 96.7 38 0.2 96.5 35 0.2 96.6 1 0 100 1 0 10073 12 0.1 96.9 22 0.1 96.8 24 0.1 96.6 18 0.1 96.7 0 0 100 1 0 10074 22 0.1 97.1 22 0.1 96.9 27 0.2 96.8 34 0.2 97 0 0 10075 13 0.1 97.1 12 0.1 97 14 0.1 96.9 27 0.2 97.1 0 0 10076 24 0.2 97.3 19 0.1 97.1 27 0.2 97.1 18 0.1 97.2 0 0 10077 18 0.1 97.4 17 0.1 97.2 23 0.1 97.2 20 0.1 97.4 0 0 10078 16 0.1 97.5 18 0.1 97.3 22 0.1 97.3 21 0.1 97.5 0 0 10079 9 0.1 97.6 18 0.1 97.4 29 0.2 97.5 10 0.1 97.5 1 0 10080 18 0.1 97.7 15 0.1 97.5 23 0.1 97.7 13 0.1 97.681 19 0.1 97.9 18 0.1 97.6 12 0.1 97.7 17 0.1 97.782 16 0.1 98 23 0.1 97.8 13 0.1 97.8 12 0.1 97.883 22 0.1 98.1 9 0.1 97.8 13 0.1 97.9 12 0.1 97.984 13 0.1 98.2 17 0.1 97.9 17 0.1 98 18 0.1 9885 6 0 98.2 18 0.1 98.1 13 0.1 98.1 13 0.1 98.186 13 0.1 98.3 14 0.1 98.1 12 0.1 98.2 17 0.1 98.287 7 0 98.4 16 0.1 98.2 6 0 98.2 17 0.1 98.3

Microengine3 Microengine4 Microengine5Microengine0 Microengine1 Microengine2

Page 126: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

118

Table D-2. SRAM Latency (unlocked) Data (2)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative88 15 0.1 98.5 11 0.1 98.3 10 0.1 98.3 16 0.1 98.489 10 0.1 98.5 8 0 98.4 13 0.1 98.3 8 0 98.490 12 0.1 98.6 8 0 98.4 17 0.1 98.4 13 0.1 98.591 6 0 98.7 12 0.1 98.5 11 0.1 98.5 5 0 98.592 4 0 98.7 5 0 98.5 8 0 98.6 12 0.1 98.693 3 0 98.7 12 0.1 98.6 11 0.1 98.6 10 0.1 98.794 10 0.1 98.8 6 0 98.6 15 0.1 98.7 13 0.1 98.895 8 0.1 98.8 10 0.1 98.7 12 0.1 98.8 7 0 98.896 8 0.1 98.9 14 0.1 98.8 1 0 98.8 5 0 98.897 6 0 98.9 7 0 98.8 8 0 98.8 14 0.1 98.998 3 0 99 9 0.1 98.9 5 0 98.9 7 0 9999 2 0 99 3 0 98.9 4 0 98.9 10 0.1 99

100 5 0 99 10 0.1 98.9 8 0 98.9 4 0 99101 6 0 99 10 0.1 99 1 0 99 7 0 99.1102 5 0 99.1 0 0 99 11 0.1 99 7 0 99.1103 4 0 99.1 7 0 99.1 8 0 99.1 5 0 99.2104 4 0 99.1 3 0 99.1 5 0 99.1 2 0 99.2105 6 0 99.2 5 0 99.1 5 0 99.1 5 0 99.2106 6 0 99.2 1 0 99.1 8 0 99.2 6 0 99.2107 3 0 99.2 4 0 99.1 5 0 99.2 4 0 99.3108 8 0.1 99.3 5 0 99.2 8 0 99.3 6 0 99.3109 2 0 99.3 6 0 99.2 5 0 99.3 2 0 99.3110 0 0 99.3 8 0 99.2 6 0 99.3 2 0 99.3111 4 0 99.3 8 0 99.3 2 0 99.3 1 0 99.3112 2 0 99.3 9 0 99.4 5 0 99.4 8 0 99.4113 4 0 99.4 0 0 99.4 4 0 99.4 2 0 99.4114 4 0 99.4 4 0 99.4 6 0 99.4 4 0 99.4115 4 0 99.4 2 0 99.4 5 0 99.5 11 0.1 99.5116 3 0 99.4 4 0 99.4 8 0 99.5 2 0 99.5117 3 0 99.5 2 0 99.4 4 0 99.5 4 0 99.5118 4 0 99.5 4 0 99.5 4 0 99.6 2 0 99.5119 2 0 99.5 4 0 99.5 4 0 99.6 2 0 99.6120 2 0 99.5 1 0 99.5 7 0 99.6 4 0 99.6121 7 0 99.6 5 0 99.5 2 0 99.6 4 0 99.6122 1 0 99.6 4 0 99.5 1 0 99.6 5 0 99.6123 1 0 99.6 2 0 99.6 4 0 99.7 4 0 99.7124 4 0 99.6 6 0 99.6 2 0 99.7 1 0 99.7125 3 0 99.6 3 0 99.6 2 0 99.7 5 0 99.7126 2 0 99.6 1 0 99.6 1 0 99.7 1 0 99.7127 3 0 99.7 1 0 99.6 1 0 99.7 5 0 99.8128 5 0 99.7 0 0 99.6 2 0 99.7 4 0 99.8129 1 0 99.7 2 0 99.6 4 0 99.7 3 0 99.8130 3 0 99.7 0 0 99.6 4 0 99.8 2 0 99.8131 0 0 99.7 2 0 99.6 0 0 99.8 0 0 99.8132 3 0 99.7 1 0 99.6 2 0 99.8 1 0 99.8133 3 0 99.8 0 0 99.6 1 0 99.8 3 0 99.8134 1 0 99.8 4 0 99.7 3 0 99.8 1 0 99.8135 1 0 99.8 0 0 99.7 2 0 99.8 0 0 99.8136 2 0 99.8 2 0 99.7 1 0 99.8 3 0 99.8137 0 0 99.8 2 0 99.7 0 0 99.8 1 0 99.8138 1 0 99.8 1 0 99.7 2 0 99.8 1 0 99.8139 1 0 99.8 2 0 99.7 1 0 99.8 1 0 99.9140 2 0 99.8 1 0 99.7 0 0 99.8 0 0 99.9141 2 0 99.8 4 0 99.7 0 0 99.8 0 0 99.9142 0 0 99.8 0 0 99.7 2 0 99.9 0 0 99.9143 2 0 99.9 1 0 99.8 0 0 99.9 1 0 99.9144 1 0 99.9 3 0 99.8 2 0 99.9 1 0 99.9145 0 0 99.9 2 0 99.8 2 0 99.9 2 0 99.9146 2 0 99 1 0 99.8 1 0 99.9 0 0 99.9147 1 0 99.9 0 0 99.8 0 0 99.9 0 0 99.9148 0 0 99.9 1 0 99.8 0 0 99.9 0 0 99.9149 0 0 99.9 1 0 99.8 0 0 99.9 1 0 99.9150 2 0 99.9 2 0 99.8 2 0 99.9 0 0 99.9151 0 0 99.9 3 0 99.8 1 0 99.9 0 0 99.9152 0 0 99.9 1 0 99.8 2 0 99.9 2 0 99.9153 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9154 0 0 99.9 1 0 99.8 1 0 99.9 1 0 99.9155 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9156 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9157 0 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9158 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9159 2 0 99.9 0 0 99.9 0 0 99.9 2 0 99.9

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 127: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

119

Table D-2. SRAM Latency (unlocked) Data (3)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative160 0 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9161 1 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9162 2 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9163 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9164 1 0 99.9 2 0 99.9 2 0 99.9 1 0 99.9165 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9166 0 0 99.9 1 0 99.9 1 0 99.9 0 0 99.9167 0 0 99.9 0 0 99.9 2 0 100 0 0 99.9168 0 0 99.9 0 0 99.9 0 0 100 0 0 99.9169 2 0 99.9 0 0 99.9 0 0 100 0 0 99.9170 0 0 99.9 0 0 99.9 0 0 100 0 0 99.9171 1 0 100 2 0 99.9 0 0 100 1 0 99.9172 0 0 100 0 0 99.9 0 0 100 1 0 100173 2 0 100 1 0 99.9 0 0 100 0 0 100174 0 0 100 1 0 99.9 1 0 100 0 0 100175 1 0 100 1 0 99.9 0 0 100 0 0 100176 0 0 100 0 0 99.9 0 0 100 0 0 100177 0 0 100 1 0 99.9 0 0 100 0 0 100178 0 0 100 2 0 99.9 0 0 100 0 0 100179 0 0 100 2 0 99.9 0 0 100 1 0 100180 0 0 100 0 0 99.9 1 0 100 2 0 100181 0 0 100 0 0 99.9 0 0 100 1 0 100182 0 0 100 0 0 99.9 0 0 100 0 0 100183 0 0 100 1 0 100 0 0 100 0 0 100184 1 0 100 0 0 100 0 0 100 0 0 100185 0 0 100 1 0 100 1 0 100 1 0 100186 1 0 100 0 0 100 0 0 100 0 0 100187 0 0 100 0 0 100 0 0 100 0 0 100188 0 0 100 0 0 100 0 0 100 0 0 100189 0 0 100 1 0 100 0 0 100 0 0 100190 0 0 100 1 0 100 0 0 100 0 0 100191 0 0 100 0 0 100 0 0 100 2 0 100192 0 0 100 2 0 100 0 0 100 0 0 100193 0 0 100 0 0 100 0 0 100 0 0 100194 0 0 100 0 0 100 1 0 100 0 0 100195 0 0 100 0 0 100 0 0 100 0 0 100196 0 0 100 0 0 100 0 0 100 0 0 100197 0 0 100 0 0 100 0 0 100 1 0 100198 0 0 100 1 0 100 0 0 100199 2 0 100 0 0 100 0 0 100200 0 0 100 0 0 100201 0 0 100 0 0 100202 1 0 100 0 0 100203 0 0 100 0 0 100204 1 0 100 0 0 100205 0 0 100206 1 0 100207 0 0 100208 0 0 100209 0 0 100210 0 0 100211 0 0 100212 0 0 100213 0 0 100214 0 0 100215 0 0 100216 0 0 100217 0 0 100218 0 0 100219 0 0 100220 0 0 100221 0 0 100222 0 0 100223 0 0 100224 0 0 100225 0 0 100226 0 0 100227 0 0 100228 0 0 100229 0 0 100230 0 0 100231 1 0 100

Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine5

232 0 0 100233 1 0 100

Page 128: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

120

Table D-3. SRAM Latency (locked) Data (1)

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative20 222 10.2 10.2 229 9.2 9.2 241 9.7 9.7 228 9.1 9.1 1087 21.8 21.8 1024 21.9 21.921 387 17.7 27.9 386 15.5 24.6 385 15.4 25.1 408 16.3 25.5 1237 248 46.6 1077 23 4522 89 4.1 32 92 3.7 28.3 91 3.6 28.7 111 4.4 29.9 353 7.1 53.7 288 6.2 51.123 132 6 38 154 6.2 34.5 150 6 34.7 140 5.6 35.5 286 5.7 59.4 277 5.9 5724 93 4.3 42.3 111 4.4 38.9 113 4.5 39.2 103 4.1 39.7 388 7.8 67.2 369 7.9 64.925 136 6.2 48.5 157 6.3 45.2 164 6.6 45.8 140 5.6 45.3 425 8.5 75.7 410 8.8 73.726 68 3.1 51.6 86 3.4 48.7 67 2.7 48.5 86 3.4 48.7 274 5.5 81.2 252 5.4 79.127 98 4.5 56.1 101 4 52.7 121 4.8 53.3 110 4.4 53.1 219 4.4 85.6 237 5.1 84.228 48 2.2 58.3 61 2.4 55.1 69 2.8 56.1 61 2.4 55.6 162 3.2 88.8 153 3.3 87.429 82 3.8 62 101 4 59.2 90 3.6 59.7 90 3.6 59.2 10 2 90.8 121 2.6 9030 42 1.9 64 47 1.9 61.1 51 2 61.8 39 1.6 60.7 77 1.5 92.4 91 1.9 9231 58 2.7 66.6 82 3.3 64.4 77 3.1 64.8 99 4 64.7 54 1.1 93.4 53 1.1 93.132 39 1.8 68.4 41 1.6 66 48 1.9 66.8 43 1.7 66.4 34 0.7 94.1 31 0.7 93.833 48 2.2 70.6 56 2.2 68.2 68 2.7 69.5 74 3 69.4 34 0.7 94.8 32 0.7 94.534 33 1.5 72.1 33 1.3 69.6 41 1.6 71.1 25 1 70.4 17 0.3 95.1 27 0.6 9535 48 2.2 74.3 43 1.7 71.3 53 2.1 73.2 59 2.4 72.8 21 0.4 95.6 21 0.4 95.536 27 1.2 75.5 23 0.9 72.21 33 1.3 74.6 28 1.1 73.9 15 0.3 95.9 14 0.3 95.837 42 1.9 77.5 49 2 74.2 35 1.4 76 49 2 75.8 12 0.2 96.1 10 0.2 9638 22 1 78.5 21 0.8 75 32 1.3 77.3 17 0.7 76.5 5 0.1 96.2 19 0.4 96.439 26 1.2 79.7 33 1.3 76.3 45 1.8 79.1 23 0.9 77.4 11 0.2 96.4 13 0.3 96.740 20 0.9 80.6 20 0.8 77.1 15 0.6 79.7 20 0.8 78.2 15 0.3 96.7 4 0.1 96.841 24 1.1 81.7 36 1.4 78.6 30 1.2 80.9 31 1.2 79.5 4 0.1 96.8 10 0.2 9742 19 0.9 82.6 24 1 79.5 27 1.1 81.9 12 0.5 80 9 0.2 97 5 0.1 97.143 16 0.7 83.3 22 0.9 80.4 26 1 83 30 1.2 81.2 7 0.1 97.1 5 0.1 97.244 12 0.5 83.8 21 0.8 81.3 13 0.5 83.5 19 0.8 81.9 5 0.1 97.2 4 0.1 97.345 17 0.8 84.6 27 1.1 82.3 15 0.6 84.1 22 0.9 82.8 8 0.2 97.4 3 0.1 97.346 9 0.4 85 12 0.5 82.8 16 0.6 84.7 19 0.8 83.6 1 0 97.4 5 0.1 97.547 25 1.1 86.2 28 1.1 83.9 26 1 85.8 17 0.7 84.3 5 0.1 97.5 4 0.1 97.548 15 0.7 86.9 15 0.6 84.5 6 0.2 86 21 0.8 85.1 5 0.1 97.6 1 0 97.649 25 1.1 88 24 1 85.5 12 0.5 86.5 19 0.8 85.9 3 0.1 97.7 2 0 97.650 7 0.3 88.3 6 0.2 85.7 15 0.6 87.1 16 0.6 86.5 2 0 97.7 3 0.1 97.751 10 0.5 88.8 10 0.4 86.1 12 0.5 87.6 20 0.8 87.3 4 0.1 97.8 1 0 97.752 7 0.3 89.1 18 0.7 86.9 6 0.2 87.8 6 0.2 87.5 2 0 97.8 4 0.1 97.853 7 0.3 89.4 10 0.4 87.3 11 0.4 88.3 12 0.5 88 2 0 97.9 6 0.1 97.954 7 0.3 89.7 4 0.2 87.4 7 0.3 88.5 4 0.2 88.2 2 0 97.9 3 0.1 9855 7 0.3 90.1 10 0.4 87.8 10 0.4 88.9 21 0.8 89 2 0 98 3 0.1 9856 11 0.5 90.6 9 0.4 88.2 6 0.2 89.2 8 0.3 89.3 3 0.1 98 5 0.1 98.157 5 0.2 90.8 13 0.5 88.7 10 0.4 89.6 7 0.3 89.6 2 0 98.1 5 0.1 98.258 7 0.3 91.1 8 0.23 89 4 0.2 89.7 3 0.1 89.7 2 0 98.1 0 0 98.259 6 0.3 91.4 18 0.7 89.7 16 0.6 90.4 11 0.4 90.2 2 0 98.1 2 0 98.360 5 0.2 91.6 2 0.1 89.8 7 0.3 90.7 4 0.2 90.3 3 0.1 98.2 1 0 98.361 7 0.3 91.9 10 0.4 90.2 10 0.4 91.1 7 0.3 90.6 3 0.1 98.3 2 0 98.462 2 0.1 92 6 0.2 90.5 4 0.2 91.2 7 0.3 90.9 0 0 98.3 0 0 98.463 4 0.2 92.2 10 0.4 90.9 3 1 91.3 10 0.4 91.3 3 0.1 98.3 3 0.1 98.464 3 0.1 92.4 6 0.2 91.1 4 0.2 91.5 7 0.3 91.6 2 0 98.4 3 0.1 98.565 6 0.3 92.6 5 0.2 91.3 4 0.2 91.7 6 0.2 91.8 1 0 98.4 3 0.1 98.566 3 0.1 92.8 4 0.2 91.5 1 0 91.7 5 0.2 92 1 0 98.4 3 0.1 98.667 1 0 92.8 6 0.2 91.7 11 0.4 92.2 6 0.2 92.3 6 0.1 98.5 1 0 98.668 3 0.1 92.9 5 0.2 91.9 2 0.1 92.2 2 0.1 92.3 1 0 98.5 3 0.1 98.769 1 0 93 9 0.4 92.3 6 0.2 92.5 7 0.3 92.6 2 0 98.6 2 0 98.770 6 0.3 93.3 1 0 92.3 7 0.3 92.8 4 0.2 92.8 1 0 98.6 1 0 98.871 5 0.2 93.5 5 0.2 92.5 4 0.2 92.9 6 0.2 93 4 0.1 98.7 3 0.1 98.872 1 0 93.5 1 0 92.6 4 0.2 93.1 3 0.1 93.1 5 0.1 98.8 5 0.1 98.973 9 0.4 94 4 0.2 92.7 7 0.3 93.4 6 0.2 93.4 2 0 98.8 1 0 9974 4 0.2 94.1 1 0 92.8 3 0.1 93.5 6 0.2 93.6 3 0.1 98.9 0 0 9975 2 0.1 94.2 9 0.4 93.1 6 0.2 93.7 7 0.3 93.9 1 0 98.9 1 0 9976 5 0.2 94.5 2 0.1 93.2 2 0.1 93.8 1 0 94 3 0.1 99 1 0 9977 3 0.1 94.6 5 0.2 93.4 2 0.1 93.9 3 0.1 94.1 1 0 99 2 0 9978 2 0.1 94.7 1 0 93.4 2 0.1 94 1 0 94.1 1 0 99 0 0 9979 2 0.1 94.8 4 0.2 93.6 4 0.2 94.1 5 0.2 94.3 5 0.1 99.1 3 0.1 99.180 1 0 94.8 4 0.2 93.8 1 0 94.2 2 0.1 94.4 1 0 99.1 0 0 99.181 2 0.1 94.9 6 0.2 94 1 0 94.2 3 0.1 94.5 2 0 99.2 1 0 99.182 3 0.1 95.1 4 0.2 94.2 3 0.1 94.3 1 0 94.6 0 0 99.2 3 0.1 99.283 1 0 95.1 2 0.1 94.2 3 0.1 94.4 2 0.1 94.6 2 0 99.2 3 0.1 99.384 1 0 95.1 3 0.1 94.4 1 0 94.5 0 0 94.6 2 0 99.2 1 0 99.385 4 0.2 95.3 3 0.1 94.5 3 0.1 94.6 3 0.1 94.8 3 0.1 99.3 1 0 99.386 0 0 95.3 0 0.1 2 0.1 94.7 3 0.1 94.9 1 0 99.3 1 0 99.387 5 0.2 95.6 2 0.1 94.6 2 0.1 94.8 4 0.2 95 0 0 99.3 2 0 99.4

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 129: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

121

Table D-3. SRAM Latency (locked) Data (2)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative88 2 0.1 95.7 2 0.1 94.6 4 0.2 94.9 2 0.1 95.1 1 0 99.3 0 0 99.489 2 0.1 95.7 2 0.1 94.7 1 0 95 9 0.4 95.5 2 0 99.4 2 0 99.490 1 0 95.8 5 0.2 94.9 2 0.1 95 5 0.2 95.7 2 0 99.4 0 0 99.491 2 0.1 95.9 2 0.1 95 1 0 95.1 2 0.1 95.8 5 0.1 99.5 0 0 99.492 1 0 95.9 1 0 95 2 0.1 95.2 3 0.1 95.9 1 0 99.5 2 0 99.493 3 0.1 96.1 2 0.1 95.1 4 0.2 95.3 2 0.1 96 1 0 99.6 3 0.1 99.594 2 0.1 96.2 1 0 95.2 1 0 95.4 1 0 96 1 0 99.6 2 0 99.695 5 0.2 96.4 3 0.1 95.3 1 0 95.4 4 0.2 96.2 1 0 99.6 0 0 99.696 1 0 96.4 2 0.1 95.4 3 0.1 95.5 2 0.1 96.2 1 0 99.6 2 0 99.697 2 0.1 96.5 3 0.1 95.5 3 0.1 95.6 2 0.1 96.3 1 0 99.6 0 0 99.698 0 0 96.5 2 0.1 95.6 2 0.1 95.7 3 0.1 96.4 0 0 99.6 1 0 99.699 1 0 96.6 2 0.1 95.6 1 0 95.8 2 0.1 96.5 2 0 99.7 0 0 99.6

100 0 0 96.6 2 0.1 95.7 1 0 95.8 0 0 96.5 3 0.1 99.7 1 0 99.6101 1 0 96.6 2 0.1 95.8 1 0 95.8 1 0 96.6 1 0 99.8 0 0 99.6102 0 0 96.6 2 0.1 95.9 3 0.1 96 1 0 96.6 0 0 99.8 2 0 99.7103 4 0.2 96.8 4 0.2 96 3 0.1 96.1 4 0.2 96.8 1 0 99.8 1 0 99.7104 1 0 96.8 0 0 96 1 0 96.1 2 0.1 96.8 0 0 99.8 0 0 99.7105 5 0.2 97.1 1 0 96.1 2 0.1 96.2 4 0.2 97 0 0 99.8 0 0 99.7106 0 0 97.1 0 0 96.1 3 0.1 96.3 1 0 97 0 0 99.8 2 0 99.7107 0 0 97.1 2 0.1 96.2 3 0.1 96.4 1 0 97.1 0 0 99.8 1 0 99.8108 1 0 97.1 1 0 96.2 3 0.1 96.6 3 0.1 97.2 1 0 99.8 0 0 99.8109 3 0.1 97.3 4 0.2 96.4 2 0.1 96.6 3 0.1 97.3 0 0 99.8 1 0 99.8110 2 0.1 97.3 1 0 96.4 1 0 96.7 2 0.1 97.4 0 0 99.8 0 0 99.8111 1 0 97.4 1 0 96.4 2 0.1 96.8 3 0.1 97.5 0 0 99.8 0 0 99.8112 0 0 97.4 3 0.1 96.6 1 0 96.8 3 0.1 97.6 1 0 99.8 0 0 99.8113 1 0 97.4 2 0.1 96.6 5 0.2 97 0 0 97.6 2 0 99.9 2 0 99.8114 0 0 97.4 1 0 96.7 1 0 97 3 0.1 97.8 1 0 99.9 1 0 99.9115 0 0 97.4 1 0 96.7 2 0.1 97.1 1 0 97.8 0 0 99.9 1 0 99.9116 0 0 97.4 1 0 96.8 1 0 97.2 0 0 97.8 0 0 99.9 0 0 99.9117 3 0.1 97.6 1 0 96.8 2 0.1 97.2 0 0 97.8 0 0 99.9 1 0 99.9118 0 0 97.6 0 0 96.8 2 0.1 97.3 3 0.1 97.9 0 0 99.9 0 0 99.9119 1 0 97.6 3 0.1 96.9 0 0 97.3 1 0 98 1 0 99.9 0 0 99.9120 2 0.1 97.7 1 0 97 0 0 97.3 0 0 98 1 0 99.9 0 0 99.9121 3 0.1 97.8 1 0 97 4 0.2 97.5 2 0.1 98 1 0 99.9 0 0 99.9122 1 0 97.9 1 0 97 0 0 97.5 3 0.1 98.2 0 0 99.9 1 0 99.9123 1 0 97.9 0 0 97 2 0.1 97.6 3 0.1 98.3 1 0 100 0 0 99.9124 0 0 97.9 1 0 97.1 1 0 97.6 0 0 98.3 0 0 100 0 0 99.9125 0 0 97.9 2 0.1 97.2 3 0.1 97.7 1 0 98.3 0 0 100 0 0 99.9126 1 0 98 3 0.1 97.3 3 0.1 97.8 1 0 98.4 1 0 100 0 0 99.9127 3 0.1 98.1 3 0.1 97.4 0 0 97.8 2 0.1 98.4 0 0 100 0 0 99.9128 1 0 98.2 1 0 97.4 1 0 97.9 2 0.1 98.5 0 0 100 0 0 99.9129 0 0 98.2 0 0 97.4 3 0.1 98 1 0 98.6 0 0 100 0 0 99.9130 0 0 98.2 1 0 97.5 2 0.1 98.1 1 0 98.6 0 0 100 0 0 99.9131 1 0 98.2 0 0 97.5 1 0 98.1 0 0 98.6 0 0 100 0 0 99.9132 1 0 98.3 0 0 97.5 0 0 98.1 1 0 98.6 0 0 100 0 0 99.9133 2 0.1 98.4 0 0 97.5 1 0.1 98.2 2 0.1 98.7 0 0 100 0 0 99.9134 0 0 98.4 3 0.1 97.6 0 0 98.2 2 0.1 98.8 0 0 100 0 0 99.9135 2 0.1 98.4 0 0 97.6 0 0 98.2 2 0.1 98.9 0 0 100 0 0 99.9136 0 0 98.4 1 0 97.6 2 0.1 98.2 2 0.1 99 0 0 100 1 0 99.9137 2 0.1 98.5 3 0.1 97.8 3 0.1 98.4 2 0.1 99 0 0 100 0 0 99.9138 1 0 98.6 2 0.1 97.8 2 0 98.4 2 0.1 99.1 0 0 100 0 0 99.9139 0 0 98.6 0 0 97.8 1 0 98.5 3 0.1 99.2 0 0 100 0 0 99.9140 0 0 98.6 1 0 97.9 1 0.1 98.5 0 0 99.2 0 0 100 0 0 99.9141 1 0 98.6 1 0 97.9 2 0.1 98.6 1 0 99.3 0 0 100 0 0 99.9142 0 0 98.6 2 0.1 98 0 0 98.6 0 0 99.3 0 0 100 0 0 99.9143 1 0 98.7 1 0 98 3 0.1 98.7 0 0 99.3 0 0 100 0 0 99.9144 1 0 98.7 0 0 98 0 0 98.7 0 0 99.3 0 0 100 0 0 99.9145 1 0 98.8 3 0.1 98.2 2 0 98.8 2 0.1 99.4 0 0 100 0 0 99.9146 1 0 98.8 2 0.1 98.2 1 0 98.8 0 0 99.4 0 0 100 0 0 99.9147 0 0 98.8 1 0 98.3 1 0 98.9 2 0.1 99.4 0 0 100 0 0 99.9148 0 0 98.8 1 0 98.3 0 0 98.9 1 0 99.5 0 0 100 0 0 99.9149 0 0 98.8 0 0 98.3 1 0 98.9 0 0 99.5 0 0 100 0 0 99.9150 1 0 98.9 3 0.1 98.4 0 0 98.9 1 0 99.5 0 0 100 0 0 99.9151 1 0 98.9 3 0.1 98.6 1 0 99 0 0 99.5 0 0 100 0 0 99.9152 0 0 98.9 2 0.1 98.6 1 0 99 0 0 99.5 0 0 100 0 0 99.9153 0 0 98.9 0 0 98.6 1 0 99 1 0 99.6 0 0 100 0 0 99.9154 0 0 98.9 0 0 98.6 0 0 99 1 0 99.6 0 0 100 1 0 100155 2 0.1 99 1 0 98.7 1 0 99.1 0 0 99.6 0 0 100 0 0 100156 1 0 99 2 0.1 98.8 1 0 99.1 0 0 99.6 0 0 100 0 0 100157 0 0 99 2 0.1 98.8 0 0 99.1 0 0 99.6 0 0 100 0 0 100158 0 0 99 0 0 98.8 0 0 99.1 1 0 99.6 0 0 100 1 0 100159 1 0 99.1 1 0 98.9 1 0 99.2 0 0 99.6 0 0 100 0 0 100

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 130: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

122

Table D-3. SRAM Latency (locked) Data (3)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative160 0 0 99.1 0 0 98.9 1 0 99.2 0 0 99.6 0 0 100 0 0 100161 1 0 99.1 2 0.1 99 1 0 99.2 0 0 99.6 0 0 100 0 0 100162 1 0 99.2 0 0 99 0 0 99.2 0 0 99.6 0 0 100 0 0 100163 0 0 99.2 0 0 99 1 0 99.3 0 0 99.6 0 0 100 0 0 100164 1 0 99.2 0 0 99 0 0 99.3 0 0 99.6 0 0 100 0 0 100165 0 0 99.2 1 0 99 1 0 99.3 0 0 99.6 0 0 100 0 0 100166 0 0 99.2 2 0.1 99.1 1 0 99.4 0 0 99.6 0 0 100 0 0 100167 0 0 99.2 1 0 99.1 0 0 99.4 0 0 99.6 0 0 100 0 0 100168 0 0 99.2 1 0 99.2 0 0 99.4 0 0 99.6 0 0 100 0 0 100169 0 0 99.2 0 0 99.2 2 1 99.4 0 0 99.6 0 0 100 0 0 100170 1 0 99.3 1 0 99.2 0 0 99.4 1 0 99.7 0 0 100 0 0 100171 3 0.1 99.4 4 0.2 99.4 0 0 99.4 0 0 99.7 0 0 100 0 0 100172 0 0 99.4 1 0 99.4 0 0 99.4 1 0 99.7 0 0 100 0 0 100173 0 0 99.4 1 0 99.4 1 0 99.5 0 0 99.7 0 0 100 0 0 100174 0 0 99.4 0 0 99.4 1 0 99.5 0 0 99.7 0 0 100 0 0 100175 0 0 99.4 1 0 99.5 1 0 99.6 0 0 99.7 0 0 100 0 0 100176 0 0 99.4 0 0 99.5 0 0 99.6 0 0 99.7 0 0 100 0 0 100177 1 0 99.5 1 0 99.5 0 0 99.6 1 0 99.8 0 0 100 0 0 100178 0 0 99.5 2 0.1 99.6 0 0 99.6 0 0 99.8 0 0 100 0 0 100179 0 0 99.5 2 0.1 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100180 1 0 99.5 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100181 0 0 99.5 1 0 99.7 1 0 99.6 0 0 99.8 0 0 100 0 0 100182 0 0 99.5 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100183 2 0.1 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100184 1 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100185 0 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 1 0 100186 0 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100187 0 0 99.6 1 0 99.8 1 0 99.6 0 0 99.8 0 0 100188 0 0 99.6 0 0 99.8 1 0 99.7 0 0 99.8 0 0 100189 2 0.1 99.7 0 0 99.8 0 0 99.7 0 0 99.8 0 0 100190 0 0 99.7 0 0 99.8 1 0 99.7 0 0 99.8 0 0 100191 0 0 99.7 0 0 99.8 1 0 99.8 1 0 99.8 0 0 100192 0 0 99.7 1 0 99.8 0 0 99.8 0 0 99.8 0 0 100193 0 0 99.7 1 0 99.8 1 0 99.8 0 0 99.8 0 0 100194 0 0 99.7 1 0 99.9 0 0 99.8 0 0 99.8 0 0 100195 1 0 99.8 0 0 99.9 1 0 99.8 1 0 99.8 0 0 100196 1 0 99.8 0 0 99.9 1 0 99.9 0 0 99.8 0 0 100197 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.8 0 0 100198 0 0 99.8 0 0 99.9 0 0 99.9 1 0 99.9 0 0 100199 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100200 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100201 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100202 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 1 0 100203 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9204 0 0 99.8 0 0 99.9 1 0 99.9 0 0 99.9205 0 0 99.8 1 0 99.9 0 0 99.9 0 0 99.9206 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9207 0 0 99.8 1 0 100 0 0 99.9 0 0 99.9208 0 0 99.8 0 0 100 0 0 99.9 0 0 99.9209 0 0 99.8 0 0 100 0 0 99.9 1 0 99.9210 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9211 0 0 99.9 0 0 100 0 0 99.9 0 0 99.9212 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9213 0 0 99.9 1 0 100 0 0 99.9 0 0 99.9214 0 0 99.9 0 0 99.9 0 0 99.9215 0 0 99.9 0 0 99.9 0 0 99.9216 0 0 99.9 0 0 99.9 0 0 99.9217 0 0 99.9 0 0 99.9 0 0 99.9218 1 0 100 0 0 99.9 0 0 99.9219 0 0 100 0 0 99.9 1 0 100220 0 0 100 0 0 99.9 0 0 100221 0 0 100 0 0 99.9 0 0 100222 0 0 100 0 0 99.9 0 0 100223 0 0 100 0 0 99.9 0 0 100224 0 0 100 0 0 99.9 0 0 100225 0 0 100 0 0 99.9 0 0 100226 0 0 100 0 0 99.9 0 0 100227 0 0 100 0 0 99.9 0 0 100228 0 0 100 0 0 99.9 0 0 100229 0 0 100 0 0 99.9 0 0 100230 0 0 100 1 0 100 0 0 100231 0 0 100 0 0 100 0 0 100

Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine5

Page 131: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

123

Table D-3. SRAM Latency (locked) Data (4)

cycles#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative232 1 0 100 0 0 100 0 0 100233 0 0 100 0 0 100234 1 0 100 0 0 100234 0 0 100235 0 0 100236 0 0 100237 0 0 100238 0 0 100239 0 0 100240 0 0 100241 0 0 100242 0 0 100243 0 0 100245 0 0 100246 0 0 100247 0 0 100248 0 0 100249 0 0 100250 0 0 100251 1 0 100

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 132: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

124

Table D-4. Receive FIFO buffer Latency Data (1)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative15 383 5.6 5.6 478 6.4 6.4 450 6 6 464 6.2 6.216 731 10.6 16.2 800 10.7 17 784 10.5 16.5 845 11.3 17.517 214 3.1 19.3 226 3 20.1 257 3.4 19.9 241 3.2 20.718 331 4.8 24.1 364 4.9 24.9 349 4.7 24.5 329 4.4 25.119 171 2.5 26.6 207 2.8 27.7 184 2.5 27 209 2.8 27.920 562 8.2 34.8 591 7.9 35.6 576 7.7 34.7 566 7.6 35.421 700 10.2 45 703 9.4 44.9 721 9.6 44.3 747 10 45.422 29 4.4 49.4 313 4.2 49.1 310 4.1 48.4 316 4.2 49.623 327 4.8 54.1 367 4.9 54 332 4.4 52.9 324 43 53.924 270 3.9 58 311 4.1 58.1 272 3.6 56.5 306 4.1 5825 310 4.5 62.6 299 4 62.1 296 3.9 60.4 282 3.8 61.826 217 3.2 65.7 209 2.8 64.9 203 2.7 63.2 190 2.5 64.327 228 3.3 69 206 2.7 67.7 224 3 66.1 190 2.5 66.828 212 3.1 72.1 245 3.3 70.9 256 3.4 69.6 233 3.1 69.929 249 3.6 75.7 296 3.9 74.9 271 3.6 73.2 270 3.6 73.530 156 3.3 78 164 2.2 77.1 213 2.8 76 185 2.5 7631 172 2.5 80.5 177 2.4 79.4 179 2.4 78.4 196 2.6 78.632 147 2.1 82.7 149 2 81.4 156 2.1 80.5 149 2 80.633 131 1.9 84.6 153 2 83.5 162 2.2 82.6 152 2 82.634 102 1.5 86.1 116 1.5 85 129 1.7 84.4 110 1.5 84.135 93 1.4 87.4 103 1.4 86.4 115 1.5 85.9 123 1.6 85.736 81 12 88.6 104 1.4 87.8 105 1.4 87.3 117 1.6 87.337 96 1.4 90 114 1.5 89.3 116 1.5 88.8 113 1.5 88.838 70 1 91 88 1.2 90.5 85 1.1 90 73 1 89.839 77 1.1 92.1 76 1 91.5 80 1.1 91 81 1.1 90.940 48 0.7 92.8 67 0.9 92.4 47 0.6 91.7 58 0.8 91.641 70 1 93.8 73 1 93.3 89 1.2 92.9 79 1.1 92.742 44 0.6 94.5 57 0.8 94.1 60 0.8 93.7 63 0.8 93.543 49 0.7 95.2 50 0.7 94.8 60 0.8 94.5 64 0.9 94.444 43 0.6 95.8 34 0.5 95.2 50 0.7 95.1 50 0.7 95.145 37 0.5 96.4 47 0.6 95.9 32 0.4 95.6 41 0.5 95.646 31 0.5 96.8 27 96.2 96.2 44 0.6 96.1 36 0.5 96.147 28 0.4 97.2 41 96.8 96.8 48 0.6 96.8 31 0.4 96.548 23 0.3 97.6 22 97.1 97.1 26 0.3 97.1 28 0.4 96.949 25 0.4 97.6 30 97.5 97.5 25 0.3 97.5 30 0.4 97.350 13 0.2 98.1 16 97.7 97.7 23 0.3 97.8 21 0.3 97.551 17 0.2 98.4 21 97.9 97.9 24 0.3 98.1 26 0.3 97.952 20 0.3 98.6 13 98.1 98.1 13 0.2 98.3 14 0.2 98.153 15 0.2 98.9 13 98.3 89.3 14 0.2 98.5 18 0.2 98.354 9 0.1 99 13 98.5 98.5 15 0.2 98.7 6 0.1 98.455 10 0.1 99.1 10 98.6 98.6 18 0.2 98.9 19 0.3 98756 8 0.1 99.3 17 98.8 98.8 8 0.1 99 13 0.2 98.857 12 0.2 99.4 8 89.9 98.9 12 0.2 99.2 13 0.2 9958 9 0.1 99.6 8 99 99 3 0 99.2 8 0.1 99.159 4 0.1 99.6 6 99.1 99.1 11 0.1 99.3 7 0.1 99.260 3 0 99.7 9 99.2 99.2 5 0.1 99.4 8 0.1 99.361 1 0 99.7 9 99.4 99.4 4 0.1 99.5 3 0 99.362 3 0 99.7 3 0 99.4 3 0 99.5 7 0.1 99.463 3 0 99.8 5 0.1 99.5 4 0.1 99.6 2 0 99.564 1 0 99.8 3 0 99.5 5 0.1 99.6 3 0 99.565 4 0.1 99.8 5 0.1 99.6 3 0 99.7 4 0.1 99.666 2 0 99.9 4 0.1 99.6 1 0 99.7 1 0 99.667 1 0 99.9 2 0 99.7 4 0.1 99.7 1 0 99.6

Microengine0 Microengine1 Microengine2 Microengine3

Page 133: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

125

Table D-4. Receive FIFO buffer Latency Data (2)

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative68 1 0 99.9 1 0 99.7 2 0 99.8 3 0 99.669 1 0 99.9 1 0 99.7 3 0 99.8 1 0 99.670 0 0 99.9 0 0 99.7 2 0 99.8 2 0 99.771 2 0 99.9 1 0 99.7 2 0 99.9 3 0 99.772 0 0 99.9 3 0 99.7 2 0 99.9 1 0 99.773 0 0 99.9 3 0 99.8 1 0 99.9 4 0.1 99.874 1 0 99.9 0 0 99.8 0 0 99.9 0 0 99.875 0 0 99.9 2 0 99.8 0 0 99.9 0 0 99.876 0 0 99.9 0 0 99.8 0 0 99.9 2 0 99.877 0 0 99.9 2 0 99.8 2 0 99.9 1 0 99.878 0 0 99.9 0 0 99.8 1 0 99.9 1 0 99.879 1 0 100 0 0 99.8 0 0 99.9 3 0 99.980 0 0 100 1 0 99.8 0 0 99.9 0 0 99.981 0 0 100 1 0 99.9 0 0 99.9 0 0 99.982 0 0 100 2 0 99.9 0 0 99.9 1 0 99.983 0 0 100 0 0 99.9 2 0 100 0 0 99.984 0 0 100 1 0 99.9 0 0 100 2 0 99.985 0 0 100 0 0 99.9 1 0 100 0 0 99.986 0 0 100 2 0 99.9 0 0 100 1 0 99.987 0 0 100 1 0 99.9 1 0 100 0 0 99.988 0 0 100 0 0 99.9 0 0 100 0 0 99.989 0 0 100 1 0 99.9 0 0 100 1 0 99.990 0 0 100 1 0 100 0 0 100 0 0 99.991 0 0 100 0 0 100 0 0 100 0 0 99.992 0 0 100 1 0 100 0 0 100 1 0 99.993 0 0 100 1 0 100 0 0 100 0 0 99.994 0 0 100 0 0 100 0 0 100 0 0 99.995 0 0 100 0 0 100 0 0 100 0 0 99.996 0 0 100 0 0 100 0 0 100 1 0 10097 0 0 100 0 0 100 0 0 100 0 0 10098 0 0 100 0 0 100 0 0 100 0 0 10099 0 0 100 0 0 100 1 0 100 0 0 100

100 1 0 100 0 0 100 0 0 100101 0 0 100 0 0 100 0 0 100102 0 0 100 1 0 100 0 0 100103 1 0 100 0 0 100104 1 0 100105 0 0 100106 0 0 100107 1 0 100108 0 0 100109 1 0 100

Microengine0 Microengine1 Microengine2 Microengine3

Page 134: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

126

Table D-5. Scratchpad RAM Latency Data (1)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative9 379 16.5 16.5 358 15.6 15.6

10 415 18 34.5 448 19.5 3511 187 8.1 42.6 189 8.2 43.212 195 8.5 51.1 194 8.4 51.713 129 5.6 56.7 138 6 57.614 110 4.8 61.5 120 5.2 62.915 80 3.5 65 62 2.7 65.616 67 2.9 67.9 60 2.6 68.217 75 3.3 71.1 73 3.2 71.318 65 2.8 74 64 2.8 74.119 52 2.3 76.2 59 2.6 76.720 49 2.1 78.4 55 2.4 79.121 52 2.3 80.6 36 1.6 80.622 39 1.7 82.3 32 1.4 8223 42 1.8 84.1 45 2 8424 35 1.5 85.7 38 1.7 85.625 35 1.5 87.2 30 1.3 86.926 22 1 88.1 27 1.2 88.127 35 1.5 89.7 34 1.5 89.628 24 1 90.7 33 1.4 9129 26 1.1 91.8 22 1 9230 20 0.9 92.7 17 0.7 92.731 21 0.9 93.6 23 1 93.732 8 0.3 94 12 0.5 94.233 20 0.9 94.8 23 1 95.234 6 0.3 95.1 16 0.7 95.935 15 0.7 95.7 9 0.4 96.336 10 0.4 96.2 6 0.3 96.637 7 0.3 96.5 8 0.3 96.938 10 0.4 96.9 8 0.3 97.339 12 0.5 97.4 6 0.3 97.540 7 0.3 97.7 3 0.1 97.741 11 0.5 98.2 7 0.3 9842 5 0.2 98.4 4 0.2 98.143 6 0.3 98.7 4 0.2 98.344 2 0.1 98.8 5 0.2 98.545 4 0.2 99 8 0.3 98.946 8 0.3 99.3 4 0.2 9947 3 0.1 99.4 0 0 9948 0 0 99.4 4 0.2 99.249 1 0 99.5 2 0.1 99.350 0 0 99.5 1 0 99.351 2 0.1 99.6 1 0 99.452 1 0 99.6 2 0.1 99.553 1 0 99.7 0 0 99.554 0 0 99.7 2 0.1 99.655 1 0 99.7 1 0 99.656 0 0 99.7 0 0 99.657 1 0 99.7 0 0 99.658 0 0 99.7 0 0 99.659 0 0 99.7 1 0 99.760 0 0 99.7 0 0 99.761 2 0.1 99.8 1 0 99.7

Microengine4 Microengine5

Page 135: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

127

Table D-5. Scratchpad RAM Latency Data (2)

cycles #ofSmpl % Cumulative #ofSmpl % Cumulative62 2 0.1 99.9 0 0 99.763 0 0 99.9 0 0 99.764 0 0 99.9 1 0 99.765 0 0 99.9 1 0 99.866 0 0 99.9 1 0 99.867 1 0 100 0 0 99.868 0 0 100 0 0 99.869 0 0 100 1 0 99.970 0 0 100 0 0 99.971 0 0 100 0 0 99.972 0 0 100 0 0 99.973 0 0 100 1 0 99.974 0 0 100 0 0 99.975 1 0 100 0 0 99.976 0 0 99.977 0 0 99.978 0 0 99.979 0 0 99.980 0 0 99.981 0 0 99.982 0 0 99.983 0 0 99.984 0 0 99.985 1 0 10086 0 0 10087 0 0 10088 0 0 10089 0 0 10090 0 0 10091 0 0 10092 0 0 10093 0 0 10094 0 0 10095 0 0 10096 0 0 10097 0 0 10098 0 0 10099 0 0 100

100 0 0 100101 0 0 100102 1 0 100

Microengine4 Microengine5

Page 136: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

128

Table D-6. FBI CSR Latency Data (1)

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative9 1090 12.5 12.5 817 13.4 13.4 856 14.3 14.3 803 13.3 13.3 959 7.6 7.6 916 7.1 7.1

10 1340 15.3 27.8 959 15.7 29.1 970 16.2 30.5 1012 16.8 30.1 1131 9 16.7 1180 9.2 16.411 705 8.1 35.9 557 9.1 38.2 541 9 39.5 539 8.9 39 801 6.4 23 825 6.4 22.812 678 7.8 43.6 468 7.7 45.9 458 7.6 47.2 552 9.1 48.1 1245 9.9 33 1247 9.7 32.513 531 6.1 49.7 343 5.6 51.5 364 6.1 53.2 327 5.4 53.6 1035 8.3 41.2 1062 8.3 40.814 440 5 54.7 317 5.2 56.7 288 4.8 58.1 306 5.1 58.6 700 5.6 46.8 696 5.4 46.215 349 4 58.7 215 3.5 60.3 206 3.4 61.5 204 3.4 62 588 4.7 51.5 640 5 51.216 309 3.5 62.3 188 3.1 63.3 202 3.4 64.9 191 3.2 65.2 508 4.1 55.5 548 4.3 55.517 251 2.9 65.1 193 3.2 66.5 173 2.9 67.8 157 2.6 67.8 505 4 59.6 465 3.6 59.118 298 3.4 68.5 169 2.8 69.3 166 2.8 70.5 162 2.7 70.4 371 3 62.5 434 3.4 62.519 224 2.6 71.1 190 3.1 72.4 146 2.4 73 144 2.4 72.8 365 2.9 65.4 413 3.2 65.720 243 2.8 73.9 156 2.6 75 134 2.2 75.2 160 2.7 75.5 429 3.4 68.9 426 3.3 69.721 187 21 76 142 2.3 77.3 141 2.4 77.6 141 2.3 77.8 380 3 71.9 407 3.2 72.222 197 2.3 78.3 120 2 79.2 154 2.6 80.1 123 2 79.9 302 2.4 74.3 319 2.5 74.723 163 1.9 80.1 108 1.8 81 102 1.7 81.8 105 1.7 81.6 301 2.4 76.7 294 2.3 7724 176 2 82.2 130 2.1 83.1 93 1.6 83.4 100 1.7 83.3 283 2.3 79 275 2.1 79.225 141 1.6 83.8 115 1.9 85 102 1.7 85.1 97 1.6 84.9 247 2 80.9 223 1.7 80.926 142 1.6 85.4 87 1.4 86.5 86 1.4 86.5 88 1.5 86.3 233 1.9 82.8 221 1.7 82.627 100 1.1 86.5 84 1.4 87.8 81 1.4 87.9 67 1.1 87.4 212 1.7 84.5 228 1.8 84.428 133 1.5 88.1 82 1.3 89.2 81 1.4 89.2 67 1.1 88.5 172 1.4 85.8 172 1.3 85.729 108 1.2 89.3 82 1.3 90.5 84 1.4 90.6 62 1 89.6 166 1.3 87.2 194 1.5 87.330 106 1.2 90.5 61 1 91.5 51 0.9 91.5 69 1.1 90.7 174 1.4 88.6 158 1.2 88.531 94 1.1 91.6 53 0.9 92.4 69 1.2 92.6 60 1 91.7 142 1.1 89.7 125 1 89.532 81 0.9 92.5 51 0.8 93.2 50 0.8 93.5 42 0.7 92.4 112 0.9 90.6 142 1.1 90.633 75 0.9 93.4 41 0.7 93.9 39 0.7 94.1 52 0.9 93.3 129 1 91.6 132 1 91.634 67 0.8 94.1 42 0.7 94.6 37 0.6 94.7 38 0.6 93.9 104 0.8 92.4 113 0.9 92.535 67 0.8 94.9 3 0.5 95.1 34 0.6 95.3 45 0.7 94.6 113 0.9 93.3 108 0.8 93.336 49 0.6 95.5 32 0.5 95.7 34 0.6 95.9 45 0.7 95.4 70 0.6 93.9 90 0.7 9437 46 0.5 96 27 0.4 96.1 38 0.6 96.5 38 0.6 96 73 0.6 94.5 84 0.7 94.738 43 0.5 96.5 27 0.4 96.5 31 0.5 97 27 0.4 96.5 89 0.7 95.2 63 0.5 95.239 26 0.3 96.8 27 0.4 97 24 0.4 97.4 20 0.3 96.8 50 0.4 95.6 67 0.5 95.740 39 0.4 97.2 26 0.4 97.4 20 0.3 97.8 26 0.4 97.2 56 0.4 96 61 0.5 96.241 27 0.3 97.5 17 0.3 97.7 11 0.2 97.9 18 0.3 97.5 65 0.5 96.5 54 0.4 96.642 21 0.2 97.8 15 0.2 97.9 15 0.3 98.2 16 0.3 97.8 50 0.4 96.9 40 0.3 96.943 25 0.3 98.1 10 0.2 98.1 13 0.2 98.4 17 0.3 98.1 37 0.3 97.2 50 0.4 97.344 24 0.3 98.3 16 0.3 98.4 11 0.2 98.6 14 0.2 98.3 44 0.4 97.6 38 0.3 97.645 20 0.2 98.6 18 0.3 98.7 11 0.2 98.8 16 0.3 98.6 38 0.3 97.9 35 0.3 97.946 19 0.2 98.8 12 0.2 98.9 9 0.2 98.9 13 0.2 98.8 30 0.2 98.1 31 0.2 98.147 13 0.1 98.9 9 0.1 99 9 0.2 99.1 11 0.2 99 31 0.2 98.4 35 0.3 98.448 15 0.2 99.1 7 0.1 99.1 8 0.1 99.2 5 0.1 99 32 0.3 98.6 25 0.2 98.649 8 0.1 99.2 6 0.1 99.2 1 0 99.2 5 0.1 99.1 21 0.2 98.8 22 0.2 98.750 13 0.1 99.3 8 0.1 99.3 5 0.1 99.3 9 0.1 99.3 15 0.1 98.9 23 0.2 98.951 4 0 99.4 4 0.1 99.4 0 0 99.3 2 0 99.3 20 0.2 99.1 17 0.1 99.152 9 0.1 99.5 6 0.1 99.5 2 0 99.3 9 0.1 99.5 17 0.1 99.2 13 0.1 99.253 3 0 99.5 3 0 99.6 5 0.1 99.4 2 0 99.5 13 0.1 99.3 14 0.1 99.354 5 0.1 99.6 3 0 99.6 3 0.1 99.5 3 0 99.5 11 0.1 99.4 15 0.1 99.455 6 0.1 99.7 6 0.1 99.7 1 0 99.5 2 0 99.6 11 0.1 99.5 10 0.1 99.556 3 0 99.7 4 0.1 99.8 5 0.1 99.6 1 0 99.6 4 0 99.5 11 0.1 99.557 4 0 99.7 3 0 99.8 4 0.1 99.6 3 0 99.6 6 0 99.6 5 0 99.658 4 0 99.8 3 0 99.9 0 0 99.6 4 0.1 99.7 3 0 99.6 7 0.1 99.659 0 0 99.8 0 0 99.9 0 0 99.7 3 0 99.8 6 0 99.6 8 0.1 99.760 1 0 99.8 1 0 99.9 3 0.1 99.7 1 0 99.8 5 0 99.7 3 0 99.761 1 0 99.8 2 0 99.9 0 0 99.7 0 0 99.8 4 0 99.7 5 0 99.862 1 0 99.8 0 0 99.9 0 0 99.7 4 0.1 99.8 5 0 99.8 8 0.1 99.863 1 0 99.8 0 0 99.9 3 0.1 99.8 1 0 99.9 5 0 99.8 3 0 99.964 1 0 99.8 0 0 99.9 0 0 99.8 1 0 99.9 4 0 99.8 3 0 99.965 2 0 99.9 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.8 0 0 99.966 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.9 5 0 99.9 0 0 99.967 0 0 99.9 1 0 100 3 0.1 99.9 0 0 99.9 1 0 99.9 2 0 99.968 1 0 99.9 1 0 100 0 0 99.9 0 0 99.9 2 0 99.9 1 0 99.969 0 0 99.9 1 0 100 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.970 0 0 99.9 0 0 100 0 0 99.9 0 0 99.9 0 0 99.9 2 0 99.971 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9 0 0 99.9 1 0 99.972 0 0 99.9 0 0 100 1 0 100 1 0 99.9 1 0 99.9 0 0 99.973 1 0 99.9 0 0 100 0 0 100 0 0 99.9 0 0 99.9 1 0 99.974 0 0 99.9 0 0 100 0 0 100 0 0 99.9 0 0 99.9 1 0 99.975 1 0 99.9 0 0 100 0 0 100 1 0 99.9 0 0 99.9 0 0 99.976 1 0 99.9 0 0 100 0 0 100 0 0 99.9 2 0 99.9 0 0 99.977 1 0 99.9 0 0 100 0 0 100 2 0 100 0 0 99.9 1 0 100

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 137: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

129

Table D-6. FBI CSR Latency Data (2)

cycles#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative78 0 0 99.9 0 0 100 0 0 100 0 0 100 2 0 100 1 0 10079 0 0 99.9 1 0 100 0 0 100 0 0 100 0 0 100 1 0 10080 0 0 99.9 0 0 100 0 0 100 0 0 100 0 0 10081 0 0 99.9 0 0 100 0 0 100 1 0 100 0 0 10082 0 0 99.9 0 0 100 1 0 100 0 0 100 0 0 10083 1 0 99.9 0 0 100 0 0 100 2 0 100 0 0 10084 0 0 99.9 0 0 100 0 0 100 1 0 100 0 0 10085 2 0 100 0 0 100 0 0 100 0 0 100 0 0 10086 0 0 100 1 0 100 0 0 100 0 0 100 0 0 10087 1 0 100 0 0 100 0 0 100 0 0 10088 0 0 100 0 0 100 1 0 100 0 0 10089 0 0 100 0 0 100 0 0 100 0 0 10090 0 0 100 0 0 100 0 0 100 0 0 10091 0 0 100 0 0 100 0 0 100 0 0 10092 0 0 100 0 0 100 0 0 100 0 0 10093 0 0 100 0 0 100 0 0 100 0 0 10094 0 0 100 0 0 100 0 0 100 0 0 10095 0 0 100 0 0 100 0 0 100 0 0 10096 0 0 100 0 0 100 0 0 100 0 0 10097 0 0 100 0 0 100 0 0 100 0 0 10098 0 0 100 0 0 100 0 0 100 0 0 10099 1 0 100 0 0 100 0 0 100 0 0 100

100 0 0 100 0 0 100 0 0 100 0 0 100101 1 0 100 0 0 100 0 0 100 0 0 100102 0 0 100 0 0 100 0 0 100103 0 0 100 0 0 100 0 0 100104 0 0 100 0 0 100 1 0 100105 0 0 100 0 0 100 0 0 100106 0 0 100 0 0 100 0 0 100107 0 0 100 0 0 100 1 0 100108 0 0 100 0 0 100 0 0 100109 0 0 100 0 0 100 0 0 100110 0 0 100 0 0 100 0 0 100111 0 0 100 1 0 100 0 0 100112 0 0 100 0 0 100113 0 0 100 0 0 100114 0 0 100 0 0 100115 0 0 100 0 0 100116 0 0 100 0 0 100117 0 0 100 0 0 100118 0 0 100 0 0 100119 0 0 100 0 0 100120 0 0 100 0 0 100121 1 0 100 0 0 100122 0 0 100123 0 0 100124 0 0 100125 0 0 100126 0 0 100127 0 0 100128 0 0 100129 0 0 100130 0 0 100131 0 0 100132 0 0 100133 0 0 100134 0 0 100135 0 0 100136 0 0 100137 0 0 100138 0 0 100139 0 0 100140 0 0 100141 1 0 100142 0 0 100143 1 0 100

Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3

Page 138: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

130

Table D-7. Hash Latency Data

cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative32 260 10.4 10.4 247 9.9 9.9 280 11.2 11.2 267 10.7 10.733 261 10.4 20.9 316 12.6 22.5 309 12.4 23.6 311 12.4 23.134 305 12.2 33.1 275 11 33.5 249 10 33.5 243 9.7 32.935 280 11.2 44.3 286 11.4 45 299 12 45.5 300 12 44.936 209 8.4 52.6 193 7.7 52.7 182 7.3 52.8 196 7.8 52.737 168 6.7 59.4 196 7.8 60.5 179 7.2 59.9 216 8.6 61.338 144 5.8 65.1 131 5.2 65.8 133 5.3 65.3 127 5.1 66.439 129 5.2 70.3 117 4.7 70.5 130 5.2 70.5 122 4.9 71.340 146 5.8 76.1 118 4.7 75.2 131 5.2 75.7 117 4.7 7641 124 5 81.1 126 5 80.2 131 5.2 81 130 5.2 81.242 88 3.5 84.6 86 3.4 83.7 80 3.2 84.2 82 3.3 84.543 56 2.2 86.9 78 3.1 86.8 74 3 87.1 62 2.5 8744 63 2.5 89.4 54 2.2 89 57 2.3 89.4 60 2.4 89.445 41 1.6 91 51 2 91 48 1.9 91.3 43 1.7 91.146 37 1.5 92.5 43 1.7 92.7 34 1.4 92.7 32 1.3 92.447 29 1.2 93.7 30 1.2 93.9 34 1.4 94 27 1.1 93.448 27 1.1 94.8 30 1.2 95.1 27 1.1 95.1 29 1.2 94.649 18 0.7 95.5 21 0.8 96 24 1 96.1 28 1.1 95.750 15 0.6 96.1 16 0.6 96.6 12 0.5 96.6 14 0.6 96.351 11 0.4 96.5 6 0.2 96.8 14 0.6 97.1 22 0.9 97.252 25 1 97.5 12 0.5 97.3 17 0.7 97.8 11 0.4 97.653 7 0.3 97.8 10 0.4 97.7 12 0.5 98.3 9 0.4 9854 13 0.5 98.3 7 0.3 98 12 0.5 98.8 17 0.7 98.655 8 0.3 98.6 10 0.4 98.4 5 0.2 99 9 0.4 9956 10 0.4 99 5 0.2 98.6 65 0.2 99.2 5 0.2 99.257 4 0.2 99.2 8 0.3 98.9 4 0.2 99.4 1 0 99.258 3 0.1 99.3 6 0.2 99.2 5 0.2 99.6 5 0.2 99.459 3 0.1 99.4 3 0.1 99.3 1 0 99.6 5 0.2 99.660 7 0.3 99.7 5 0.2 99.5 0 0 99.6 5 0.2 99.861 1 0 99.8 5 0.2 99.7 3 0.1 99.7 1 0 99.962 1 0 99.8 3 0.1 99.8 2 0.1 99.8 0 0 99.963 0 0 99.8 0 0 99.8 0 0 99.8 0 0 99.964 0 0 99.8 1 0 99.8 0 0 99.8 0 0 99.965 0 0 99.8 1 0 99.9 0 0 99.8 0 0 99.966 1 0 99.8 1 0 99.9 1 0 99.8 1 0 99.967 0 0 99.8 0 0 99.9 0 0 99.8 0 0 99.968 0 0 99.8 0 0 99.9 2 0.1 99.9 1 0 10069 0 0 99.8 0 0 99.9 1 0 100 1 0 10070 1 0 99.9 1 0 100 0 0 10071 0 0 99.9 0 0 100 0 0 10072 0 0 99.9 0 0 100 0 0 10073 0 0 99.9 1 0 100 0 0 10074 1 0 99.9 1 0 10075 0 0 99.976 0 0 99.977 1 0 10078 0 0 10079 0 0 10080 1 0 100

Microengine0 Microengine1 Microengine2 Microengine3

Page 139: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

131

Appendix E: Multithreading Example

Figure E-1. Multithreading example

Page 140: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

132

Appendix F: Theoretical Throughput Calculation for IP

Packets

Table F-1. Theoretical Throughput of IP Packets

Media64-byte PPS(46-byte IPpacket)

594-byte PPS(576-byte IPpacket)

1518-byte PPS(1500-byte IPpacket)

Mixture (avg406-byte) PPS(avg 388-byteIP packet)

100MbpsEthernet 148,810 20,358 8,127 29,343

Gigabit 1,488,095 203,583 81,274 293,42710GigabitEthernet 14,880,952 2,035,831 812,744 2,934,272

OC-3 POSCRC- 16 348,491 31,681 12,256 46,759

OC-12 POSCRC- 16 1,412,830 128,439 49,688 189,570

OC-24 POSCRC- 16 2,825,660 256,878 99,376 379,139

OC-48 POSCRC- 16 5,651,321 513,756 198,752 758,278

OC-192 POSCRC- 16 22,605,283 2,055,026 795,010 3,033,114

OC-3 POSCRC- 32 335,818 31,573 12,240 46,524

OC-12 POSCRC- 32 1,361,455 128,000 49,622 188,615

OC-24 POSCRC- 32 2,722,909 256,000 99,245 377,229

OC-48 POSCRC- 32 5,445,818 512,000 198,489 754,458

OC-192 POSCRC- 32 21,783,273 2,048,000 793,956 3,017,834ATM OC-3 174,245 26,807 10,890 38,721ATM OC-12 706,415 108,679 44,151 156,981ATM OC-24 1,412,830 217,358 88,302 313,962ATM OC-48 2,825,660 434,717 176,604 627,925ATM OC-192 11,302,642 1,738,868 706,415 2,511,698

Note: These throughput numbers are for IP traffic because of comparative purposes

on different encapsulations. The POS (Packets over SONET) performance

calculations use CRC-16 and CRC-32.

Page 141: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

133

I present the methodology for calculating PPS on each media below.

IP over Ethernet

As described in Section 5.6, there is 38bytes protocol overhead per one IP packet.

Maximum theoretical throughput on Ethernet is calculated as follows.

Maximum Packet Per Second (PPS) = Ethernet Data Rate (bps)/ {(18-byte

Ethernet header and trailer + IP packet size + 12-byte IFG + 8-byte Preamble/SFD)

x 8}

IP over SONET

First of all, we have to consider pure data rate including media and protocol

overhead. An OC-1 (STS-1) frame consists of 9 rows by 90 columns of 8 bit bytes (9 x

90 x 8 = 6480 bits/frame). Frames are sent at a rate of 8000 frames per second (125

microsecond frame length). Therefore, the gross data rate (i.e. total bandwidth) of

an OC-1 frame is 6480bits x 8000 frames/sec =51.84 Mbps. Next, an OC-3 (or

STM-1) frame consists of 9 rows by 270 columns of 8 bit bytes (9 x 270 x 8 = 19940

bits/frame). The gross data rate is 19940 x 8000 frames/sec = 155.52 Mbps. An

OC-12 (or STM-4) frame consists of 9 rows by 1080 columns of 8 bit bytes (9 x 1080 x

8 = 77760 bits/frame). The gross data rate therefore is 77760 x 8000 frames/sec =

622.080 Mbps. An OC-24 (or STM-8) frame consists of 9 rows by 2160 columns of 8

Page 142: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

134

bit bytes (9 x 2160 x 8 = 155520 bits/frame). The gross data rate therefore is 155520

x 8000 frames/sec = 1244.160 Mbps. An OC-48 (or STM-16) frame consists of 9 rows

by 4320 columns of 8 bit bytes (9 x 4320 x 8 = 311040 bits/frame). Hence, the gross

data rate is 311040 x 8000 frames/sec = 2488.320 Mbps. Finally, an OC-192 (or

STM-64) frame consists of 9 rows by 17280 columns of 8 bit bytes (9 x 17280 x 8 =

1244160 bits/frame). The gross data rate is 12444160 x 8000 frames/sec = 9953.280

Mbps.

Secondary, what we have to do is to calculate SONET data rate, which takes

media overhead off. In OC-1, the first 3 columns contain the transport overhead,

which includes the section overhead and the line overhead. The remaining 87

columns are called the synchronous payload envelope (SPE), which contain the path

overhead and payload. Path overhead is 1 column by 9 rows, leaving 86 columns for

payload. As a result, the SONET data rate is directed by next equation.

SONET Data Rate = (90col – 3 transport overhead – 1 path overhead) x 9 row x

8bits/byte x 8000fps) = 49.536 bps

Similarly, the SONET Data Rate for other classes can be calculated as follows.

Table F-2 presents summarized data for Gross data rate and SONET data rate.

Payload bps = (Nx (90 col - 3 transport_col )- 1 Path_col) x 9 row x 8 bits/byte x

8000 fps) where N=OC-N

Page 143: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

135

Table F-2 SONET and SDH Multiplex Rates

OpticalCarrier

SDH STMSignal

GrossData Rate(Mbps)

SONET Data Rate (Mbps)(Takes both 3 columnTransport and 1 column PathOverhead in SPE into account)

OC-1 51.84 49.536OC-3 STM-1 155.52 147.76OC-12 STM-4 622.08 599.04OC-24 STM-8 1244.16 1198.08OC-48 STM-16 2488.32 2396.16OC-192 STM-64 9953.28 9584.64

In addition, protocol overhead encapsulating data should be taken into account.

In table F-1, the performance is calculated based on CRC16 and 32. Hence, Packet

over SONET (POS) maximum PPS is calculated as follows.

CRC-16 header = 7bytes = 1byte delimiter + 4bytes HDLC + 2bytes CRC16

CRC-32 header = 9bytes = 1byte delimiter + 4bytes HDLC + 4bytes CRC32

POS max PPS = OC-N SONET Data Rate / (IP packet size + CRC header bytes)

IP over ATM

The throughput of IP packet traffic over ATM is calculated as follows.

AAL5 PDU size = IP packet size + 8-byte SNAP + 4-byte AAL5 overhead

+ 4-byte CRC = IP packet size + 16-byte

ATM cell count = roundup (AAL5 PDU size/ 48-byte)

Page 144: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

136

Total cell bytes = cell count x 53-byte cell size

ATM max PPS = OC-n SONET Payload Data Rate / Total cell bytes

Note: AAL: ATM Adaptation Layer, PDU: Protocol Data Unit, SNAP: Subnetwork

Access Protocol

Page 145: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

137

Appendix G. Instruction Set of Other NPs

Table G-1. MIPS-1 Instruction Set (Integer only)

Instruct ion Descr ipt ionAr i thmet ic and Logical Instruct ionsadd dest, src1, src2 Addaddi dest, src1, imm Add immediateaddu dest , s rc1 , s rc2 Add unsignedaddiu dest, src1, imm Add immediate unsignedsub dest, src1, src2 Subt rac tsubu dest, src1, src2 Subt rac t uns ignedand dest, src1, src2 Andandi dest, src1, imm And immediatediv src1, src2 Dividedivu src1, src2 Divide unsignedmul t s rc1 , s rc2 Mult iplymul tu s rc1 , s rc2 Mult iply unsignedor dset, src1, src2 O rori dest, src1, src2 Or immedia tenor des t , s rc1 , s rc2 Norsl l dest, src1, src2 Shif t lef t logicalsl lv dest, src1, src2 Shif t lef t logical variablesra dest, src1, src2 Shi f t r ight Ar i thmet icsrav dest, src1, src2 Shi f t r ight Ar i thmet ic var iablesr l des t , s rc1 , s rc2 Shif t r ight logicals r lv des t , s rc1 , s rc2 Shift r ight logical variablesub dest, src1, src2 Subt rac tsubu dest, src1, src2 Subt rac t uns ignedxor dest, src1, src2 Xorxor i dest, src1, imm Xor immediateBranch and Jump Ins t ruc t ionsbeq src1, src2, of fset Branch on equalbne src1, src2, of fset Branch on not equalbgez src, offset Branch on greater than equal zerobgezal src, offset Branch on greater than equal zero and l inkbgtz src, of fset Branch on greater than zeroblez src, offset Branch on less than equal zerobgezal src, offset Branch on greater than equal zero and l inkbl tz src, of fset Branch on less than zerobltzal src, of fset Branch on less than zero and l inkj labe l Jumpjal label Jump and l inkja l r s rc Jump and l ink reg is terj r s rc Jump reg is te rr f e Return f rom except ionComparison Instruct ionss l t des t , s rc1 , s rc2 Set less thanslt i dest, src2, imm Set less than immediatesl tu dest , src1, src2 Set less than unsignedsl t iu dest , src1, imm Set less than u immediate uns ignedLoad and Store Inst ruct ionslb dest, imm(src) Load by telbu dest, imm(src) Load unsigned bytelh dest, imm(src) Load hal fwordlhu dest, imm(src) Load unsigned halfwordlw dest , imm(src) Load wordlwl dest , imm(src) Load word lef tlwr dest, imm(src) Load word r ightsb src1, imm(src2) S to re by tesh src1, imm(src2) Store hal fwordsw src1, imm(src2) S to re wordswl src1, imm(src2) Store word lef tswr src1, imm(src2) Store word r ightConstant-Manipulat ing Inst ruct ionslui dest, imm Load upper immediateMiscel laneous Instruct ionsmfhi dest Move f rom himf lo dest Move f rom lomth i dest Move to h imt lo dest Move to lomfcz dest Move f rom coprocessor z

Page 146: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

138

Table G-2. PowerNP Picoprocessor Opcodes Instruction DescriptionAlu OpecodeArithmetic Immediate(AluOp)add result = opr1 + opr2add w/carry result = opr1 + opr2 + Csubtract result = opr1 - opr2subtract w/carry resul t = opr1 - opr2 -Cxor result = opr1 XOR opr2and result = opr1 AND opr2or result = opr1 OR opr2shift left logical result = opr1 << opr2, fill with 0sshift right logical result = fill with 0, opr1 >> opr1shift right arithmetic result = fill with S, opr1 >> opr2rotate right result = fill with opr1, opr1 >> opr2compare opr1 - opr2test opr1 AND opr2not result = NOT(opr1)transfer result = opr2Logical Immediate Opecode(LOp)xor result = opr1 XOR opr2and result = opr1 AND opr2or result = opr1 OR opr2test opr1 AND opr2Compare Immediate OpcodeCompare Immediate(1) Compare odd GPR register with Immediate dataCompare Immediate(2) Compare even GPR register with Immediate dataCompare Immediate(3) Compare word GPR register with Immediate data zero extendCompare Immediate(4) Compare word GPR register with Immediate data sign extendLoad immediate OpcodeLoad immediate(1) Load odd halfword GPR from immediate dataLoad immediate(2) Load even halfword GPR from immediate dataLoad immediate(3) Load word GPR from immediate data zero extendedLoad immediate(4) Load word GPR from immediate data 0 postpendLoad immediate(5) Load word GPR from immediate data 1 extendedLoad immediate(6) Load word GPR from immediate data 1 postpendLoad immediate(7) Load word GPR from immediate data sign extendedLoad immediate(8) Load GPR byte 3 from low byte of immediate dataLoad immediate(9) Load GPR byte 2 from low byte of immediate dataLoad immediate(10) Load GPR byte 1 from low byte of immediate dataLoad immediate(11) Load GPR byte 0 from low byte of immediate dataArithmetic/Logical Register OpcodeBit clear uses and (AluOp)Bit set uses or (AluOp)Bit flip uses xor (AluOp)Count Leading Zeros Opcode

Count Leading Zeros returns the number of zeros from left to r ight unti l the f irst 1-bit isencountered

Control Opcodesnop executes one cycle of time and doesn't change any state

exit terminates the current instruction stream.* The CLP wil l be put intoan idle state and made available for a new dispatch

test and branch tests a single bit within a GPR register

branch and link performs a condition branch*, adds one to the value of the currentprogram counter, and placesit onto the program stack

return performs a conditional branch* with the branch destination being thetop of the program stack

branch register performs a conditional branch*branch pc relative performs a conditional branch*branch reg+off performs a conditional branch*Data Movement Opcodesmemory indirect transfers data between a GPR and a coprocessor array via a logical

address in which the base offset into the array is contained in a GPR

memory add indirecttransfers data between a GPR and a corocessor data entity (scalar orarray) by mapping the coprocessor via a logical address into the baseaddress held in the GPR indicated by opcode

memory direct transfers data between a GPR and a coprocessor array via a logicaladdress that is specified in the immediate portion of the opcode

scalar access transfers data between a GPR and a scalar register via a logicaladdress that consists of a coprocessor number and a scalar register

scalar immed writes immediate data to a scalar register via a logical address that iscompletely specified in the immediate portion of the opcode

transfer quadword transfers quadword data from one array loction to another using oneinstruction

Page 147: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

139

Table G-3. PowerNP Picoprocessor Condition Codes for conditional branch

zero array zeroes out a port ion of an array wi th one instruct ionCoprocessor Execut ion Opcodes

execute direct ini t iates a coprocessor command in which al l of the operat ionarguments are passed immediately to the opcode

execute indirect ini t iates a coprocessor command in which the operat ion argumentsare a combination of a GPR register and an immediate f ield

execute direct condi t ional s imi lar to the execute d i rect opcode except that the i t can be issuedcondit ionally based on the cond f ield

execute indirect condi t ional s imi lar to the execute ind i rect opcode except that the i t can beissued condit ional ly based on the cond f ield

wait synchronizes one or more coprocessorswait and branch synchronizes wi th one coprocessor and branch

Note: Condi t ional branch* depends on Condi t ion codes. Data Movement Opcodes support 23 opt ions ofdirection, size, extension , and fi l l.

Condition codes0 equal or zero1 not equal or not zero2 carry set3 unsigned higher4 unsigned lower or equal5 unsigned lower or equal6 always7 signed positive8 signed negative9 signed greater or equal

10 signed greater than11 signed less than or equal12 signed less than13 overflow14 no overflow

Page 148: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

140

List of Figures

2-1. Internet Hierarchy 6

3-1. Router Processing on Fast Path 8

3-2. Pseudo Code of Receive Thread Main Loop 20

3-3. Pseudo Code of Transmit Scheduler Main Loop 23

3-4. Pseudo Code of Transmit Thread Main Loop 27

4-1. Architecture of the Intel IXP1200 31

4-2. Microengine Architecture 32

4-3. FBI Unit Architecture 35

4-4. Ready Bus and Ready Flags 36

4-5. Microengine Pipeline 37

4-6. Memory Access flow 39

4-7. Branch pipeline example with class3 instruction 41

4-8. Branch pipeline example with class2 instruction (case1) 42

4-9. Branch pipeline example with class2 instruction (case2) 43

4-10. Branch pipeline example with class1 instruction 43

4-11. Branch pipeline example with deferred branch instruction 44

4-12. Branch pipeline example with guess instruction 46

4-13. Branch pipeline example with guess and deferred branch options 46

5-1. Instruction Mix for Receiving Packets 50

5-2. Instruction Mix for Transmitting Packets 51

5-3. Instruction Mix for Overall Processing 52

Page 149: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

141

5-4. SDRAM Latency 53

5-5. SRAM Latency (unlocked) 55

5-6. SRAM Latency (locked) 55

5-7. Executing, Aborted, Stalled, and Idle ratio on 64bytes Workload 57

5-8. Executing, Aborted, Stalled, and Idle ratio on 594bytes Workload 58

5-9. Executing, Aborted, Stalled, and Idle ratio on 1518bytes Workload 58

5-10. Executing, Aborted, Stalled, and Idle ratio on Mixture Workload 59

5-11. CPI for Microengines 60

5-12. Throughputs (bounded) 62

5-13. Throughputs (unbounded) 64

6-1. NetVortex Context Switch Mechanism 68

6-2. Coprocessor Execution Opcode Example (Wait Opcode) 71

A-1. Receive Ready Check 76

A-2. Receive Request Issue 77

A-3. Receive Packet Status Acquisition 78

A-4. Packet Buffer Allocation 79

A-5. Port Fail/Error Check 79

A-6. MAC Packet Header Acquisition 80

A-7. Parse Packet 81

A-8. Ethertype Field Classifier 82

A-9. Filter 82

A-10. Port information Acquisition for Filter 86

A-11. IP Header Acquisition 87

Page 150: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

142

A-12. IP Version Check 87

A-13. IP Header Check & Modify 88

A-14. IP verify 89

A-15. IP Modify 90

A-16. Packet Discard 91

A-17. Trie Lookup 91

A-18. Next_Trie_Search for Trie Lookup 95

A-19. Write Modified IP and Ether Header 95

A-20. Transmit Assignment Read 96

A-21. Transmit Packet Link List Read 97

A-22. Transmit Packet Link List Update 97

A-23. Transmit Port Vector clear 98

A-24. Last Packet Transfer 98

A-25. Set Transmit Control Word 99

A-26. TFIFO Validate 99

A-27. Transmit Port Vector Modify 1 0 0

A-28. Transmit Packet Transfer 1 0 1

D-1. Receive FIFO buffer Latency 111

D-2. Scratchpad RAM Latency 111

D-3. FBI CSR Latency 1 1 2

D-4. Hash unit Latency 1 1 2

E-1. Multithreading example 1 3 1

Page 151: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

143

List of Tables

3-1. Frequently occurred packets in the real Internet 12

3-2. Workloads of Fixed size packets 15

3-3. Workload of Internet Packets Mixture 15

4-1. Instructions Categorized by Class 40

4-2. Guess Branch Instructions 45

6-1 NetVortex extended Instruction set 67

6-2. C-5 Coprocessor Zero Register Definitions 70

B-1. Microengine Instruction Set 1 0 2

C-1. Instruction Mix Data for 64bytes packets 1 0 6

C-2. Instruction Mix Data for 594bytes packets 107

C-3. Instruction Mix Data for 1518bytes packets 1 0 8

C-4. Instruction Mix Data for Mixture packets 1 0 9

C-5. Memory Access per cycle 1 1 0

D-1. SDRAM Latency Data 1 1 3

D-2. SRAM Latency (unlocked) Data 1 1 7

D-3. SRAM Latency (locked) Data 1 2 0

D-4. Receive FIFO buffer Latency Data 1 2 4

D-5. Scratchpad RAM Latency Data 1 2 6

D-6. FBI CSR Latency Data 1 2 8

D-7. Hash Latency Data 1 3 0

F-1. Theoretical Throughput of IP Packets 1 3 2

Page 152: Workload Characterization and Performance for a Network ...palms.ee.princeton.edu/PALMSopen/miyazaki02workload.pdf · time-to-market frames. Network processors (NPs) are now very

144

F-2. SONET and SDH Multiplex Rates 1 3 5

G-1. MIPS-1 Instruction Set (Integer only) 1 3 7

G-2. PowerNP Picoprocessor Opcodes 1 3 8

G-3. PowerNP Picoprocessor Condition Codes for conditional branch 1 3 9


Recommended