Workload Characterization and Performance for a Network
Processor
Mitsuhiro Miyazaki
B.S., Osaka University (Japan), 1994
A Thesis Presented to the Faculty of
Princeton University
in Candidacy for the Degree of
Master of Science in Engineering
Professor Ruby B. Lee
Research Advisor
The Department of Electrical Engineering
Princeton University
June , 2002
i
Abstract
The explosive growth of the Internet and e-business requires faster deployment
of high-bandwidth equipment, greater flexibility to support emerging Internet
technologies, and new services within the network. The design of routers is being
changed significantly by the emergence of Network Processors (NPs). With
programmable NPs, exceptionally fast packet processing at high-bandwidth is
achieved through the optimization of both the instruction set and data path.
Routers have to perform complicated protocol stack processing with demands
for various services. However, the fast packet switching path, namely the packet
forwarding path with table look-up, filtering, queuing assignment and input/output
scheduling, influences network performance more than the slow packet switching
path. This paper characterizes router processing with pseudo code based on the fast
path for an emerging network processor, the Intel IXP1200. It also addresses the
workload characterization of the fast path in routers. I expect that those
characterizations should be very useful to guide the architectural design of future
network processors. It should also be very beneficial for comparing the performance
of different solutions to fast path processing, using combinations of different
network processors, general-purpose processors and hardware ASICs.
Network Processors (NPs) are generally designed for edge or backbone routers.
Therefore, NPs need to be adapted to a wide range of networks. For high-end
networks, NPs may be assigned to OC-192 (10Gbps). However, such rates actually
are still outside the reach of existing NP products. In reality, the primary target of
ii
current NPs could be up to OC-48 (2.49Gbps) wire-speed, which delivers
minimum-sized packets at 5.65 million packets per second (pps). In this paper, I
evaluate IXP1200 from the computer architect’s point of view, rather than the
network infrastructure point of view. First of all, this paper presents the
Instruction Mix (i.e. distribution of executed instructions) of Microengine at the
fast packet switching path on the basis of simulation results of a reference program
provided by Intel, and clarifies the types of instructions most important for NPs
from the perspective of instruction set architecture design. Next, it shows the
latencies in accessing external/internal resources such as SDRAM, SRAM, and the
receive FIFO buffer. Since IXP1200 can hide such latencies by context swap
instructions, we can easily figure out the benefit of it because memory access takes
generally numerous cycles and affects CPU efficiency. In addition, it evaluates CPI
(Cycle per Instruction) and the ratio of executing, aborted, stalled, and idle cycles,
which implies efficiency of multithreading and fast context swap in IXP1200.
Finally, this paper presents the throughput of the IXP1200 it can achieve in OC-48,
and shows comparison between IXP1200 and other well-known NPs regarding
context switch and branch mechanism.
iii
Acknowledgments
I am indebted to my advisor, Ruby Lee, for the valuable feedback and help
throughout my entire thesis. I will always admire Ruby’s vision and inspiration, as
well as her kind-hearted personality.
Paul Huang, Abdulla F. Bushait, and Robert Miller have always been helpful to
me and have certainly made my life easier and more fun at Princeton. Thanks to
Aaron Moore, a close friend at Princeton to talk to about many different subjects.
Thanks to Shiro and Kuriko Okita, and Hidechika and Emi Koizumi for engaging
me in many interesting discussions and experience. I am also very grateful to Lidija
Lukic for providing me English advice and wonderful knowledge. Thanks to Richard
G. Knight for giving me precious and enjoyable information outside my field.
Special thanks to my parents Yukiko and Shigeto Miyazaki for their love, and
the desire they have instilled in me to learn and excel. Thanks to my sisters Chie
and Shiho for cheering me up. I’m also very grateful to my wife’s parents Yoshiko
and Yukio Kaneko for support and encouragement to finish this thesis.
Most of all, I thank my wonderful wife Atsuko who has helped me and had
patient throughout this entire process and deserves much credit for this thesis. She
has been very supportive, and I truly appreciate all that she has done for me while
working full time, and providing a warm and loving environment as a family.
iv
Contents
1 Introduction 1
2 Network Configuration and Market for Network Processors 4
3 Router Processing and Workload Characterization 7
3.1 Router Processing 7
3.2 Workload Characterization and Proposal 11
3.3 Pseudo Code of Router Processing 15
3.3.1 Receive Packet Processing 17
3.3.2 Transmit Packet Processing 22
4 Network Processor Architecture 30
4.1 Microengine Architecture 31
4.2 FBI Unit Architecture and IX Bus Interface 34
4.3 Microengine Pipelining 36
4.4 Memory Access 38
4.5 Branch and Context Switch Mechanism 40
4.5.1 Class3 Instructions 41
4.5.2 Class2 Instructions 42
4.5.3 Class1 Instructions 43
4.5.4 Solutions for branch penalties 44
5 IXP1200 Network Processor Evaluation 47
5.1 Methodology 47
5.2 Instruction Mix 49
v
5.3 Latency 52
5.4 Execution, Aborted, Stalled, and Idle Ratio 56
5.5 CPI (Cycle per Instruction) 59
5.6 Throughput 60
6 Other Network Processors 65
6.1 Lexra’s NetVortex 66
6.2 Motrola’s C-5 68
6.3 IBM’s PowerNP 70
7 Conclusions and Future work 72
8 Bibliography 74
Appendix A Pseudo Code 76
Appendix B Microengine Instruction Set 102
Appendix C Instruction Mix Data 106
Appendix D Latency 111
Appendix E Multithreading Example 131
Appendix F Theoretical Throughput Calculation for IP Packets 132
Appendix G Instruction Set of Other NPs 137
List of Figures 140
List of Tables 143
1
1. Introduction
The network bandwidth is a critical resource in the Internet in resent years.
Routers perform key functions to accommodate increasing traffic from the user.
Until late 1990s, an edge device employed a high performance general-purpose
CPUs to perform tasks such as header processing, forwarding, table lookups, access
control and implementing the network stack. As another approach, ASICs were
considered to use because they can perform tasks at wire-speed rate. However, the
general-purpose CPUs and ASICs respectively have problems at the point of
performance and flexibility. Furthermore, agile delivery of products is also required
to provide sophisticated communication infrastructures within very limited
time-to-market frames. Network processors (NPs) are now very expected to fill the
need that CPUs and ASICs fail to meet. NPs are programmable engines that are
optimized to perform wire-speed communication. They have made it possible to
significantly improve the performance, flexibility of routers, and even agility of
delivery.
Routers generally can be placed in some edges and backbones. Therefore, NPs
would be adapted to a variety of speed in the Internet. For high-end network, NPs
could be assigned to OC-192 (10Gbps), which delivers 22.6 million minimum-sized
packets-per-second (pps) at the maximum. (Note: Minimum-sized packet is defined
as 64Bytes packet in the paper.) However, generally speaking, we would be required
to employ multiple NPs in order to achieve such high speed. In reality, the main
target of current NPs such as the Intel IXP1200 [1], the Vitesse IQ2000 [2], the
2
Motorola C-5 [3], and the IBM PowerNP [4] could be OC-48 (2.49Gbps) wire-speed,
which delivers minimum-sized packets at 5.65Mpps, otherwise it could be OC-24
(1.24Gbps), OC-12 (622Mbps) or OC-3 (155Mbps).
Routing is an inherently parallel task in routers because each packet that
traverses a network is a self-contained package with its own destination header and
data payload. Routers have to process each packet independently and out of order,
supervise myriads of packets in parallel, and control such a huge traffic. NPs are
commonly composed of multiple processors, and allow them to perform multiple
threads simultaneously, which would be deeply related to IP routing, data
forwarding, and other header processing. For example, Intel’s IXP1200 includes six
independent RISC processors called Microengines, each supporting four contexts
with hardware multithreading. As a result, IXP1200 can manage 24 completely
independent threads, execute data-intensive tasks of steering packets toward their
destinations on networks in parallel, and hide the long latencies of off-chip memory
references by rapidly switching contexts among threads.
Routers fundamentally have to perform complicated protocol stack with
demands of various services and speed. This paper, first of all, describes Internet
hierarchy and clarifies target markets of NPs in Section2. Then, it characterizes
router processing and workload not only for router system analysis but also for NP
analysis in Section3. Since performance requirements of NPs should be defined on
the basis of a particular realistic workload, four workload models are proposed for
NP simulation based on the real Internet packet data. Four workload models
actually consist of three fixed-size packet workloads (64Bytes, 594Bytes, and
3
1518Bytes) and one mixture-packet workload including those three packets. It then
introduces the architecture of Microengine and Fast Bus Interface (FBI) unit,
responsible for transferring packets from and to external Media Access Control
(MAC) layer devices, in Section 4. In addition, it especially presents Microengine
pipelining, memory access, and branch and context switch mechanism.
This paper presents some experimental results and evaluation for Microengine.
Section 5 describes the evaluation methodology, and then presents five simulation
results of four proposed workloads with regards to Microengines; 1) Instruction mix
(i.e. distribution of executed instructions), 2) Memory access latency, 3) The ratio of
execution, aborted, stalled, and idle cycles, 4) CPI (Cycle per Instruction), and 5)
Throughput on the fast switching path. In Instruction mix, instruction sets of a
Microengine are categorized into five, and analyzed for dependence of workloads. In
addition, the advantage of context switch is visually proved by the result of memory
access latencies and the ratio of stalled. CPI shows computer architectural
restriction of a Microngine and dependence on workload. Besides, the throughput
presents how much rate only one IXP1200 can achieve for wire-speed
communication. Finally, it introduces other famous NPs and compares with IXP1200
for the context switch and branch mechanism in Section 6, and then concludes
research results in Section 7.
4
2. Network Configuration and Market for Network Processors
In response to the rapid and extensive growth of Internet traffic, Internet
service providers (ISPs) are experiencing constant demands for expanded services
and network features. Backbone operators are also involved in the demands for
high-speed switching and routing. The Internet's explosive growth is driving
requirements for higher quality, faster connectivity, and more software features for
an ever-growing number of customers. Routers would be deployed at various places
of the Internet. In this section, I present a theoretical image of the Internet, and
clarify the target markets of Network Processors (NPs) in the Internet.
Figure 2-1 depicts the Internet hierarchy divided into five levels. The first level
is the Network Access Points (NAPs) where major Internet backbone operators,
called Network Service Providers (NSPs), interconnect to establish the core concept
of an Internet. NSPs also interconnect at Metropolitan Area Exchanges (MAEs).
Since MAEs serve the same purpose as the NAPs and are privately owned, they are
not shown in Figure 2-1. The second level is the national backbone operators,
sometimes referred to as National Service Providers (NSPs), and the network of
networks spreads out from there. Some of the large NSPs are UUNet, IBM, BBN
Planet, SprintNet, PSINet, etc. The third level of the Internet is made up of regional
networks and the companies that operate regional backbones. Typically, they
operate backbones within a state or among several adjoining states much like the
NSPs. They typically connect to a NSP, or increasingly to several NSPs to be on the
Internet. Some have a connection to a single NAP, and then they extend the network
5
to smaller cities and towns in their areas. In general, the group of level 1, 2, and 3
can be called Core. The Core is thought of a huge network mixed with Synchronous
Optical Network (SONET)/ Synchronous Digital Hierarchy (SDH), Frame Relay, and
Asynchronous Transfer Mode (ATM), and consisting of Core routers and switches
that correspond to a variety of high-speed network; OC-192/STM64 (10Gbps),
OC-48/STM16 (2.49Gbps), OC-24/STM8 (1.24Gbps), OC-12/STM4 (622Mbps),
OC-3/STM1 (155Mbps), T3/DS3 (45Mbps), T1/E1 (1.5Mbps/2Mbps) and so on. Some
backbone maps can be found at [5], [6], and [7].
The fourth level of the Internet is the individual Internet Service Providers
(ISPs). They lease connections to a NSP, or a regional network operator. An ISP
network usually consists of a number of POPs, which stand for Point of Presences. A
POP is a physical location where a set of Edge and Core routers is located. Therefore,
even though level 4 can be called Edge basically, a part of level 4 could be recognized
as a part of the Core. The Edge routers generally provide individual subscribers
with access to the Core network, and also required to support various speed ranges.
The speed range could be OC-24/STM8, OC-12/STM4, OC-3/STM1, T3/DS3, T1/E1
and so on. The fifth level of the Internet is the consumer and business market, and
basically includes Access routers, that connect a customer to a POP of ISPs, and
Customer routers, that can be connected to end points of Internet. The required
speed of those routers is much less than that of Edge routers and Core routers. Since
big enterprises work much like ISP once in a while, they may also have a Core
router and an Edge router to connect to some branch offices.
In fact, the main target application of NPs would be Edge routers and Core
6
routers/switches. Those routers require more powerful processing capability and
flexibility than Access and Customer routers. Therefore, NPs obviously need to
achieve wire-speed on the optical network level and have programmability,
flexibility, and scalability.
Note: Level1-3: Core, Level4: Edge, and Level5: Consumer and Business
Figure 2-1. Internet Hierarchy
Level 2
Level 3
Level 4
POP
POP
Enterprise
Level 5
……
NSP
RegionalNetwork
RegionalNetwork
RegionalNetwork
RegionalNetwork
POPPOP
POP
POP
ConsumerBusiness
Business
Consumer
Consumer
…
…
…Business Core Routers
Core Switches
Server
Edge Routers
Access Routers
ISP
ISP
ISPSmall office
Level 1NAP
NSP
Customer Routers
7
3. Router Processing and Workload Characterization
3.1 Router Processing
Routers are the most common network layer devices in the Open Systems
Interconnection (OSI) seven layers model. Routers are connected to at least two
networks and decide which way to send each information packet based on its current
understanding of the state of the networks it is connected to. Routers create or
maintain a table of the available routes and their conditions, and use this
information along with distance and cost algorithms to determine the best route for
a given packet. Typically, packets may travel through a number of network points
with routers before arriving at its destination. Routers actually support a variety of
functions in addition to IP routing. In reality, router’s functions would depend on the
specifications of routers vendors make. However, the fundamental functions covered
by most of routers can be generalized. To assess NPs processing performance, we
should especially focus on fast path processing and characterize router processing
basically executed by software in most cases. A good reference to components of
Routers is found in [8]. This section characterizes fundamental router processing
based on it.
8
Figure 3-1. Router Processing on Fast Path
Figure 3-1 depicts router processing based on fast path (i.e. forwarding path).
First of all, Input Scheduler (IS) manages input port sharing and gets a packet from
an input port. The received packet could be once placed into a receive FIFO (RFIFO).
In reality, this packet comes through Physical layer device (PHY) and Medical
Access Control (MAC) device with framing and error detection at data link layer
until getting to IS. After that, the packet is parsed and then Classifier (CF) chooses
an appropriate Receive Packet Buffer (RPB) assigned to a Forwarder (FW) based on
certain fields in the packet header. In general, different FW can be applied to
incoming packets according to different protocol, service type, priority, flow control
and so on. Most of data link protocols have some sorts of protocol identifier fields
that can be used to select FW on a specific interface. For example, the type field in
9
Ethernet and the Logical Link Control (LLC)/Subnetwork Access Protocol (SNAP)
defined in IEEE802.2 can be frequently used for identification techniques, not
simply the LAN protocols. In addition, IP option and type of service (TOS) fields of
IPv4, and priority and flow label fields of IPv6 could be applied for classification.
The classification decision can be made on fields generally associated with OSI
Layer2 through 4.
Routers typically provide access control mechanisms for permitting or denying
the flow of packets. Since the router parses a packet in the process of classification,
it can get necessary information for Filter (FL) at the same time, perform FL
operation, and discard some irrelevant packets before they are forwarded. There are
various filtering operations in OSI Layer2 and 3. The Layer2 FL could permit or
deny forwarding based on a MAC source and/or destination address, protocol type,
Ethernet vendor code, or LLC information. The typical FL parameters of Layer3
include Layer 3 source and/or destination addresses, either explicitly or after a
wildcard mask is applied. Other parameters include IP protocol type, TOS/IP
precedence bits, and TCP and UDP port values. The latter parameters are actually
Layer 4 information, but commonly specified in a Layer 3 context.
In fact, this sort of FL can be put before or after FW or both. In particular, the
FL positioned before/after FW is respectively defined as Inbound Filter/Outbound
Filter. Inbound FL action can be applied to all incoming packets, but Outbound FL
can be applied to some specified packets. Even though we could say that Outbound
FL is more efficient than Inbound FL, their specification depends on router makers.
In Figure 3-1, the FL implies Inbound Filter.
10
The forwarder (FW) picks packets out of Receive Packet Buffer (RPB). In
general, FW manipulates the TTL and checksum fields of the IP header, performs IP
lookup based on forwarding table, modifies the data link level header and IP header,
and delivers the packet toward output ports. Routers commonly have two key data
structures for lookup table; Routing Information Base (RIB) and Forwarding
Information Base (FIB). RIB is optimized for updating by the dynamic routing
information mechanisms such as Routing Information Protocol (RIP), Interior
Gateway Routing Protocol (IGRP), Enhanced Interior Gateway Routing Protocol
(EIGRP), Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP). On
the other hand, FIB is optimized for high-speed lookup and packet forwarding. RIB
is not illustrated in Figure 3-1 because RIB is not included in fast path. FIB is
expected to have efficient data structures, algorithms, and once in a while
hardware-assisted lookup for rapid forwarding. The fundamental data structure of
FIB can be some sort of hash table or tree look-up table, including forwarding
information such as prefix and next hop.
After forwarding, Queuing Assignment (QA) puts outbound packets into
Transmit Queuing Buffer (TQB) corresponding to an output port. Routers basically
must implement some queuing discipline that governs how packets are buffered
while waiting to be transmitted. The queuing algorithm is composed of scheduling
discipline and drop policy. The simplest queuing algorithm is First in First Out
(FIFO) queuing with tail drop policy. Tail drop means that packets arriving at the
end of FIFO are dropped if the FIFO is full. A simple variation of FIFO queuing is
then priority queuing. In this case, the router implements two kinds of FIFO
11
queues; priority and non-priority queue. For example, the priority could depend on
the type of Service (TOS) in the IP header. Additionally, Fair queuing (FQ) algorithm
evenly maintains multiple queues for each flow currently being handled by the
router. The router actually serves these queues in a round-robin manner.
In reality, the router could use more sophisticated strategies such as Weighted
Fair Queuing (WFQ). WFB assigns a weight to each flow, namely queue. This weight
logically specifies how many bits to transmit each time the router services that
queue, which effectively controls the percentage of the link’s bandwidth that flow
will get. It also could be implemented on classes of traffic like TOS in the IP header.
Output Scheduler (OS) selects one of non-empty TQB, transfers a packet into
Transmit FIFO (TFIFO), and then sends it to the associated output ports. With FQ,
the output scheduler checks TQB in round robin manner and delivers the packet.
The scheduler generally performs no processing on the packet. However, suppose
multiple paths to the same destination exist, Load Balancing (LB) could be
employed on output scheduler. LB optimizes the use of bandwidth and recovery time
after link or interface failures by preconditions. LB is based on round-robin,
per-packet, per-destination, source-destination hash, and so on.
3.2 Workload Characterization and Proposal
Generally speaking, real Internet traffic includes various sizes and types of
packets, which generate network load and affect routers’ processing ability.
Therefore, when testing routers, it is very important to consider what kinds of
12
packets occur most frequently in real Internet stream.
In this section, I characterize workloads for NP evaluation and propose four
workloads for testing Microengine based on real data of Internet measurements,
which are collected by the Measurement and Network Analysis Group from the
National Laboratory for Applied Network Research (NLANR) project located at San
Diego Supercomputer Center [10]. The reference data were gathered during the
month of February, 2001 under the National Science Foundation Cooperative
Agreement No.ANI-9807479, and NLANR. In addition, the real data are used by
router tester makers, such as Agilent Technologies [11]. The NLANR Measurement
and Network Analysis Group is actually monitoring real Internet packets and
recording them every day. The raw data can be found on the NLANR web site at
[12].
Briefly summarized, a total of 342 million packets were sampled and recorded
at the network monitor site during this period. The average packet size was
402.7bytes, with the following packet sizes and types occurring most frequently
(Table 3-1).
Table 3-1. Frequently occurred packets in the real Internet
Packet
Size
Packet Type Description Packets
Distribution
Internet
Traffic
1) 40
Bytes
TCP packets with IP header but no payload (i.e. only
20 Bytes IP header plus 20 Bytes TCP header),
35% 3.5%
13
typically sent at the start of a new TCP session.
2) 576
Bytes
The default IP Maximum Datagram Size (MDS)
packets without fragmentation, including the default
TCP Maximum Segment Size (MSS) 536 Bytes
packets.
11.5% 16.5%
3) 1500
Bytes
Packets corresponding to the Maximum Transmission
Unit (MTU) size of an Ethernet connection.
10% 37%
40 Bytes packets are generally used for three-way handshake of TCP connection
establishment or termination. These packets are very often delivered in the Internet,
and expected to give big CPU load for routers. However, since these packets are
small, they represent only 3.5 % of the Internet traffic.
IP packets can be logically set up to 65, 536 Bytes as a maximum length. But
there have been a long established rule on RFC879 [13]. Hosts must not send
datagrams larger than 576 bytes unless they have specific knowledge that the
destination host is prepared to accept larger datagrams. As a result, basically the
default IP Maximum Datagram Size is 576 Bytes, which consists of IP header (20
Bytes), TCP header (20 Bytes) and the TCP Maximum Segment Size (MSS) (536
Bytes). Although the distribution of these packets is smaller than 40 Bytes packets,
the Internet traffic load is larger than them due to the size of packets.
Ethernet is very popular packets format handled by routers. From the result of
Table 3-1, it turns out that Packets corresponding to the Maximum Transmission
Unit (MTU) size of an Ethernet connection occupy the Internet traffic considerably
14
because of the size. Several other packet sizes occurred more frequently than normal,
where normal is defined as more than 0.5% of all packets; for example 52, 1420, 44,
48, 60, 628, 552, 56, and 1408 Bytes.
Four sorts of packets are proposed as workloads in this paper. First of all, I
propose three workloads of the fixed-size packet streams based on Table3-2. Each
workload is formatted as Ethernet packets based on three sizes of the most
frequently occurred packets in real Internet. In other words, those workloads
represent packets of table 3-1 encapsulated by Ethernet header and trailer. 64 Bytes,
594 Bytes, and 1518 Bytes Ethernet packet workloads are provided for Microengine
simulation on the assumption that an IXP1200 built-in router has 16 x 100Mbps
Ethernet ports. A 64 Bytes Ethernet packet actually includes 6 Bytes padding data
in addition to 14 Bytes Ethernet header, 20 Bytes IP header, 20 Bytes TCP header
and 4 Bytes Ethernet Trailer because the shortest frame length of Ethernet is
decided as 64Bytes.
In addition, I propose a simple mixture of three packets as fourth workload.
Some router manufacturers commonly use this mixture as a “quick and dirty”
approximation of the Internet packet mixture. Table 3-3 shows the proposed mixture
ratio of packet size. I tried to approximate the traffic load of 64 Bytes and 1518
Bytes to the value of 40 Bytes and 1500 Bytes shown in Table 3-1. Regarding 594
Bytes packets, I regard it as a representative of other size packets between 64 Bytes
and 1518 Bytes. It has an average packet size of 406 Bytes. Suppose we assume that
packets of Table 3-1 are simply formatted as Ethernet packets, the average packet
size is 420.7 Bytes. Therefore, we can expect that the proposed workload can be
15
correlated with the realistic Internet traffic very closely (A correlation value: 0.965).
Table 3-2. Workloads of Fixed size packets
Packet Size Packet Type Description
1) 64 Bytes The minimum-size Ethernet packets, consisting of 14 Bytes Ethernet header,
20 Bytes IP header, 26 Bytes Payload, and 4 Bytes Ethernet trailer (FCS), and
being expected to be used for TCP handshake
2) 594 Bytes Ethernet packets including 14 Bytes Ethernet header, 20 Bytes IP header, 556
Bytes Payload (assuming 20 Bytes TCP header plus 536 Bytes MSS), and 4
Bytes Ethernet trailer (FCS)
3) 1518 Bytes The maximum-size Ethernet packets, consisting of 14 Bytes Ethernet header,
20 Bytes IP header, 1480 Bytes Payload and 4 Bytes Ethernet trailer (FCS)
Table 3-3. Workload of Internet Packets Mixture
Packet Size (Bytes) Proportion of Total Traffic Load
64 50% (6 parts) 7.881%
594 41.7% (5 parts) 60.96%
1518 8.3 % (1 parts) 31.158%
3.2 Pseudo Code of Router Processing
This section presents pseudo-code for the loop programs executed by each
16
Microengine in this research simulation, and makes clear the router processing
explained in Figure3-1. Even though IXP1200 has six Microengines and allows total
twenty four threads to run on it, four Microengines dedicate themselves to the
receive processing and two Microengines focus on the transmit processing in the
simulation.
The receive processing is actually defined as from when a packet comes into an
input port and until the packet is enqueued into Transmit Queuing Buffer (TQB).
The transmit processing is regarded as from when the packet is dequeued and until
the packet is sent to an output port. In fact, the transmit functions are split across
two Microengines. Each transmit Microengine contains a scheduler thread and
three transmit threads.
Figure3-2 through 3-4 presents pseudo code for receive and transmit processing.
In reality, the code is composed of a variety of function codes. In Appendix A, some
substantial segmented pseudo codes are presented as reference. The pseudo codes
are actually simplified a bit and context switch descriptions are eliminated from
them for easy comprehension of router processing.
In the code, an italic word means a register. If nothing is added to it, it just
represents general purpose register (GPR) with context-relative addressing mode
that is specified to only one thread and can’t be read or written by other threads in a
Microengine. Even if only “ $ “ is added, it displays context-relative SRAM transfer
register. Similarly, only “ $$ “ represents context-relative SDRAM transfer register.
Suppose “ @ ” is placed in front of them, the register is regarded as absolute register,
that can be read or written by any one of the four threads executing in a
17
Microengine and distinguished from context relative register. Besides, the thread
can access control status register (CSR) in Fast Bus Interface (FBI) unit for packet
processing. CSR_ denotes CSR register. Additionally data terminology of IXP1200 is
expressed by quadword: 64bits, longword: 32bits, word: 16bits, and byte: 8bits.
(Note: Microengine and FBI architecutures are described in Section 4.)
3.2.1 Receive Packet Processing
Figure3-2 indicates pseudo code for receive processing main loop. This code is
assigned to each of 16 threads, and each thread is bound to a specific port number
and receive FIFO (RFIFO) element number. Since there are 16 ports and Receive
FIFO has 16 Elements (Each element has 64bytes for incoming packet and 16 bytes
for extended data and status) in IXP1200, Receive Thread number 0 to 15 is equally
assigned with the Port number 0 to 15 and the RFIFO element number 0 to 15.
First of all, Input Scheduler checks receive ready flags, that indicate a packet is
ready in external Media Access Control (MAC) device, by reading REC_RDY register
(Note: Ready flags are elaborated in Section 4.2). It then issues receive request to
FBI by setting rec_req into REC_REQ register (Note: rec_req should be prepared in
initialization process). Each function uses semaphore so that only one receive thread
can read receive ready flags, and only one receive thread posts a receive request at a
time.
Once FBI starts to receive a packet from the MAC device, a start_receive signal
is asserted to the receive thread and inform that the packet data is in the RFIFO
18
element, and then receive_status reads the control information from RCV_CTL
register and set necessary information into rec_state and exception register.
RCV_CTL contains start-of-packet (SOP) and end-of-packet (EOP) assertions and
error indications from the MAC. The receive thread then allocates a packet
descriptor and buffer by a SRAM pop operation. When the allocation is performed,
the function needs some parameters such as packet buffer base address
(PKBUF_BASE), buffer size (PKBUF_SIZE), descriptor base address (DESC_BASE),
and descriptor size (DESC_SIZE). In addition, the thread checks if the receive port
has fail or error by the content of exception register and increment exception
counter.
From the result of receive_status, suppose rec_state register contains SOP bit,
the thread reads MAC packet header from the RFIFO into SRAM transfer register
named $pkt_buf, and extracts 2bytes protocol/length field. In addition, parse_packet
classifies the packet into three link types 1) Ethernet, 2) 802.3 and LLC, 3) 802.3
and LLCSNAP based on protocol/length field, and then extracts ethertype indicating
upper layer packet type such as IP and ARP. The pkstate register contains packet
status of the link type and packet discard decision. A router typically could have
multiple and different kinds of forwarders corresponding to different service,
priority, and protocol. Even though this loop program uses ethertype just for
filtering and handles only one type of forwarder, ethertype could be used for
selecting different forwarders in general. I depict the example pseudo code of
classifier for different forwarders in Appendix A.
Once the thread finishes to classify the packet, it puts ethertype and all header
19
information into etherfilter and then filters the packet based on port configuration.
The thread discards the packet according to the result of pkaction register.
Otherwise, forwarder process is invoked.
In Forwarder, first of all, get_IP_header transfers IP header into $pkt_buf_IP,
and the thread extracts IP version field, whose position depends on packet link types.
Suppose the packet is IP version 4 without options, the thread directly transfers
remained payloads from the RFIFO into the buffer, then checks total length, TTL
(Time to Live), and checksum in the IP header, and finally outputs the result into
exception register. After that, it decrements TTL by 1 and modifies checksum value
because of the change of TTL. These functions are packed into xferpayload_&
_iphdrchck_&_modify. Even if the packet is not version 4 or has options, the packet
is transferred to buffer and then enqueued into core stack interface queue so that
StrongARM can process the packet. Hence, the thread sets core stack interface bit in
output_intf representing output queue interface.
Ip_trie5_lookup supports a dual lookup of direct entry table and trie block
lookups in SRAM. It searches the best matching prefix and gets a route pointer,
namely index to route entry in SDRAM, from IP destination address. The route
pointer is actually stored in rt_ptr register. The thread then acquires the forwarding
information such as destination MAC address and output port number from the
SDRAM route entry to $$dxfer register based on rt_ptr. After that, the IP header is
modified on the basis of the forwarding information and then the modified header is
written to buffer and prepended to the payload already in SDRAM.
Suppose rec_state is not SOP, in other words continuous packet data is coming
20
in the RFIFO, port_rx_bytecount_extract extracts last packet's byte count from
rec_state, and it is changed to quad count. The thread updates target buffer address
and then transfers 64bytes data from RFIFO to the buffer
If the packet is EOP and not discarded, the descriptor is maintained in SRAM.
In addition, the packet is enqueued based on descriptor SRAM address (desc_addr),
output interface indicating transmit queue (output_intf), type of queue like linked
list, circular, etc (Q_TYPE), location in Scratchpad for packet present indication
called Port with Packets (PWP) (Q_RDY), base address of queue (Q_BASE), base
address of descriptor buffers (DESC_BASE). In the enqueue_packet, the transmit
queue is practically locked just before enqueuing so that other threads can’t access it.
The thread then sets PWP bit (only one bit per port) in Scratchpad because transmit
thread can figure out that the packet is prepared to send toward output port, and
finally unlock the transmit queue. In case that rec_state contains discard bit, the
thread sets packet buffer address to the beginning of the packet in order to discard
the packet and reuse the buffer.
RECEIVE_THREAD_MAIN_LOOP:
// Input Scheduler
receive_ready_check()
receive_request(rec_req)
(rec_state, exception) = receive_status()
mpacket_received:
if (pkbuf_addr == UNALLOCATED)
(pkbuf_addr, desc_adder) = pkbuf_allocate (PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
end if
port_rx_fail_error_check(exception)
21
// Classifier
if (bit(rec_state, REC_STATE_SOP_BIT)) // if SOP
(proto_len, $pkt_buf) = get_mpkt_header()
(pkstate, ethertype) = parse_packet(proto_len)
// Filter
pkaction = etherfilter(ethertype,$pkt_buf)
if (pkaction == PKT_DENY)
pk_late_discard(rec_state, rec_req, exception)
else
// Forwarder
$pkt_buf_ip = get_IP_header()
ip_version = IP_version_check(pkstate,$pkt_buf_ip )
if (ip_verslen == IPV4_NO_OPTIONS)
(exception,ipdest) = xferpayload_&_iphdrchck_&_modify(pkbuf_addr,rfifo_addr,pkstate)
if (exception)
pk_late_discard(rec_state, rec_req, exception)
// Lookup
else
rt_ptr = ip_trie5_lookup(ip_dest, SRAM_ROUTE_LOOKUP_BASE)
copy $$dxfer <- DRAM(addr(router_base + rt_ptr), size(3quadwords))
write_modified_IP_Ether_header($$dxfer)
end if
else // IP with options or frag
copy RFIFO(addr(rfifo_addr + QWOFFSET0), size(8quadwords))
-> DRAM(addr(pkbuf_addr + QWOFFSET0)
output_intf = CORE_STACK_INTF1 <<3
end if
end if
else // not SOP
current_bytecount = port_rx_bytecount_extract(rec_state)
current_qwcount = current_bytecount >> 3
pkbuf_addr = pkbuf_addr + 8
copy RFIFO(addr(rfifo_addr + QWOFFSET0), size(current_qwcount)) -> DRAM(addr(pkbuf_addr + QWOFFSET0)
22
endif // not SOP
xbuf_free($pkt_buf)
if (bit(rec_state, REC_STATE_EOP_BIT)) // if EOP
if (!bit(rec_state, REC_STATE_DISCARD_BIT))
$desc_buf = update_describtor()
copy $desc_buf -> SRAM(addr(desc_addr + LWOFFSET0), size(2lword))
// Enqueue
enqueue_packet(desc_addr, output_intf, Q_TYPE, Q_RDY, Q_BASE, DESC_BASE)
pkbuf_add = UNALLOCATED
xbuf_free($desc_buf)
else
pkbuf_addr = buf_dram_addr_from_sram_addr(desc_addr,PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,
DESC_SIZE)
end if
rec_state = 0
end if
xbuf_free($$dxfer)
goto RECEIVE_THREAD_MAIN_LOOP
Figure3-2. Pseudo Code of Receive Thread Main Loop
3.2.2 Transmit Packet Processing
As described, a transmit packet processing consists of a transmit scheduler and
three transmit threads that move a packet from transmit buffer to transmit FIFO,
and each one is allocated to one of four threads in each Microengine. Figure 3-3
presents the pseudo code of the transmit scheduler main loop and the segmented
pseudo codes. First of all, the scheduler reads Ports with Packets (PWP) from the
Scratchpad address specified by pwp_addr to $pwp SRAM transfer register for
23
polling. The PWP information is aggregated with the values of @local_pwp register,
which locally presents current status of PWP and is modified by all transmit threads.
The scheduler then creates three transmit assignments by tx_assign. @assign#
means absolute GPR used as mailboxes to hold the transmit requests for the
transmit thread. In fact, the number “#” is corresponding to the number of the
transmit thread (1 to 3) in a Microengine. The scheduler also checks each port for
enqueued data sequentially. If there are any outbound packets queued for
transmission on that port, new assignment is set into @assign# and target port
number is incremented for next check. However, before the transmit assignment is
updated, the scheduler is required to wait on semaphore until the transmit thread
becomes idle. The valid bit (31-bit) of @assign# is actually used as semaphore in
order to share it with other transmit threads. If next port does not have any queued
packets, it just sets the skip bit and updates target port number.
// Scheduler
TRANSMIT_SCHEDULER_MAIN_LOOP:
copy $pwp <- Scratch(addr(pwp_addr), size(1bit))
aggregate_pwd = $pwp | @local_pwp
(target_port, @assign1) = tx_assign(target_port, @assign1, aggregate_pwp , SKIP_BIT, PORT_INCR)
(target_port, @assign2) = tx_assign(target_port, @assign2, aggregate_pwp , SKIP_BIT, PORT_INCR)
(target_port, @assign3) = tx_assign(target_port, @assign3, aggregate_pwp , SKIP_BIT, PORT_INCR)
goto TRANSMIT_SCHEDULER_MAIN_LOOP
24
//******************************** Segmented Pseudo code ***********************************
//************** Format of Transmit Assignment ************
RES: Reserved
//*********************************************************
tx_assign(target_port, @assign#, aggregate_pwp , SKIP_BIT, PORT_INCR)
{
new_target_port = target_port + PORT_INCR
if ((1 && aggregate_pwp >> (target_port, bit_enable(LS5bit)) > 0)
new_assignment = target_port
goto tx_assign
else
new_assignment = new_assignment | 1 << SKIP_BIT
end if
tx_assign:
sem_wait(@assign#) // if semaphore set, exit
@assign# = new_assignment
target_port = new_target_port
}
sem_wait(@assign#)
{
begin:
if (@assign# < 0) // watch bit31(semaphore), if set, then exit
goto end
else
goto begin
end:
}
//******************************************************************************************
Figure3-3. Pseudo Code of Transmit Scheduler Main Loop
31 30:9 8 7:4 3:0
Valid RES Skip Port RES
25
Figure 3-4 shows the pseudo code of the transmit thread main loop. When the
last packet is transferred into TFIFO, the transmit assignment (@assign#) are
inverted and set semaphore to allow the scheduler to input the new transmit
assignment in it. Tx_assignment_read waits until the scheduler sets the next
transmit assignment for the thread and simultaneously flips the semaphore bit in
tx_assignment_read. Once the new assignment is set, the transmit thread gets it
and extracts port number (port), skip flag (skip_flag), TFIFO elements (tfifo_entry),
and the transmit queue offset (q_offset). In the IXP1200, there are 16 TFIFO
elements specific to output ports, and each contains 64bytes for outbound packets
and 16bytes for control and prepend field. In the code, the TFIFO element is actually
the same as the port number. Since the transmit thread is assigned to two
Microengines, one Microengine takes even TFIFO elements (total 8), the other takes
odd TFIFO elements (total 8).
If skip_flag is not set, the thread extracts port information such as ele_
remaining, buf_offset, bank and last_mpkt_byte_cnt from global-port-in-progress
registers (@port_inprog0 -7) associated with each port. Ele_remaining indicates the
number of remaining elements in an outbound MAC packet. Buf_offset denotes
offset from the top of the transmit packet buffer to the start of the valid data. Bank
indicates SDRAM bank that the packet is in. Last_mpkt_byte_cnt denotes byte
enable for the last packet. If the last packet has been completely transferred to the
TFIFO, namely ele_remaining equals 0, the thread locks the SRAM transmit queue
so that other threads can’t access, reads 2 long words queue descriptor for the next
packet from it, and gets head and tail pointers of link lists which configures linked
26
buffers. In addition, it reads 2 long words packet link list and extracts next
ele_remaining, last_mpkt_byte_cnt, bank, and buf_offset for the packet. The thread
then updates the queue descriptor. First of all, it decrements the packet count by 1
and sets it in the SRAM transfer register ($q_desc1). The packet count shows the
number of packets in the transmit queue. If the packet count is more than 0, the
thread changes existing tail pointer to new head pointer and merges it with new tail
pointer. The merged head and tail pointers are set in the SRAM transfer register
($q_desc0). Finally, those descriptor values ($q_desc0 and 1) are written into the
transmit queue descriptor and then the queue is unlocked.
Suppose ele_remaining equals 1, it suggests that only one 64 bytes packet is
queued. In fact, tx_last_mpkt directly moves the packet from SDRAM to the TFIFO.
Tx_status_set prepares the control information containing paramenters such as port
number and EOP and SOP flags into $tfifo_ctl_wd0 register. Then the tfifo_validate
writes it to the control field of the TFIFO. In addition, the tfifo_validate reads
transmit pointer and transmit ready flags from FBI until transmit pointer is equal
or one less than the current TFIFO element. The transmit pointer is continuously
maintained by the Transmit State Machine (TSM) in the FBI, and points to the
TFIFO element that the TSM expect to send next. Transmit ready flags indicate
that ports will accept data. Then, if transmit port is ready, the transmit thread sets
“Pass” into return_status register and the valid flag in FIFO. When the SDRAM
transfer is complete, the SDRAM controller also sets a valid bit into TFIFO control
Field. When both valid bits are set, the TSM commence transfer of the data from the
TFIFO to the MAC device. If the result is fail, the thread sets skip bit in TFIFO
27
control field.
Suppose return_status displays “Pass”, the thread releases the packet
descriptor and buffer. Otherwise, the thread saves remaining elements, last packet
byte count, and buffer offset in @port_inprog0 –7. Moreover, tx_portvect_modify sets
@local_pwp based on port number.
Even if the packet is placed in SOP position or EOP position or between them,
the thread similarly performs packet transferring, control field set, and data
validation and the check. The first ele_remaining is 0 and the second ele_remaining
is more than 2, the packet is regarded as SOP but not EOP. Even if the first one is 0
and the second one is 1, the packet is EOP packet. If the first one is 0 and the second
one is not 1, the packet is stored between SOP and EOP. Only if the EOP is
transferred to TFIFO, @local_pwp has to be cleared.
If the transmit scheduler sets the skip bit in the Transmit Assignment, the
transmit thread is responsible for ensuring that the TSM skips over the assigned
TFIFO element. To do this, the transmit thread issues a null SDRAM transfer to
force the SDRAM controller to set the TFIFO valid bit. The transmit thread then
sets the skip bit in the TFIFO control field and sets its valid bit. This procedure
allows the TSM not to transfer the data in this TFIFO element, but to go on to the
next one.
TRANSMIT_THREAD_MAIN_LOOP:
@assign# = ~(@assign#)
(q_offset,port,skip_flag,tfifo_entry) = tx_assignment_read()
process_assignment:
28
if (skip_flag != SKIP_BIT_SET)
(ele_remaining,buf_offset,bank,last_mpkt_byte_cnt) = tx_portinfo_restore(port,@port_inprog0-7 )
if (ele_remaining == 0) // if no elements left in last packet
($q_desc0, $q_desc1, $pkt_link0, $pkt_link1, tail_ptr, ele_remaining, bank, buf_offset,
last_mpkt_byte_cnt) = tx_pktlinklist_read(q_desc_base, q_offset, buf_desc_base)
tx_pktlinklist_update($q_desc0, $q_desc1, q_desc_base, tail_ptr, q_offset, $pkt_link0, port)
if (ele_remaining == 1) // if sop and eop
tx_last_mpkt_xfr(bank, buf_offset, last_mpkt_byte_cnt, tfifo_entry, pkt_buff_base)
$tfifo_ctl_wd0 = tx_status_set(last_mpkt_byte_cnt, EOP_AND_SOP, port)
return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)
if (return_status == PASS) //if good validate
tx_ll_buf_free($q_desc0, bank, buf_desc_base, DESC_SIZE, 16)
else //could not validate
(@port_inprog0-7) = tx_portinfo_sop_save()
tx_portvect_modify(@local_pwp , port, 1)
end if
else // sop, but not eop
tx_portvect_modify[@local_pwp , port, 1]
tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8) // 64 bytes transfer
$tfifo_ctl_wd0 = tx_status_set(const_0, 0xfd, port)
return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)
if (return_status == PASS)
(@port_inprog0-7) = tx_portinfo_save()
else
(@port_inprog0-7) = tx_portinfo_save_no_decr()
end if
end if
else // not sop
if (ele_remaining == 1) // if NOT SOP, but EOP
tx_last_mpkt_xfr(bank, buf_offset, last_mpkt_byte_cnt, tfifo_entry, pkt_buffer_base)
$tfifo_ctl_wd0 = tx_status_set(last_mpkt_byte_cnt, 0x2, port)
return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)
if (return_status == PASS)
tx_ll_buf_free(buf_offset, bit20on, bank, buf_desc_base, DESC_SIZE, 3)
29
tx_portinfo_update() //clear local pwp bit for this port
tx_portvect_modify(@local_pwp , port, 0) // clear bit number "port"
end if
else // NOT SOP and NOT EOP
tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8)
$tfifo_ctl_wd0 = tx_status_set(const_0, 0xfc, port) //no eop, no sop
return_status = tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)
if (return_status == PASS)
tx_portinfo_update()
end if
end if
end if
transmit_done:
else // given a "skip tfifo element" assignment
tfifo_element_skip_nordy(tfifo_entry, pkt_buffer_base)
end if
goto TRANSMIT_THREAD_MAIN_LOOP
Figure3-4. Pseudo Code of Transmit Thread Main Loop
30
4. Network Processor Architecture
The network processor typically covers the fundamental functions of routers and
improves speed in the hardware architecture. The network processor has
programmability to enable easier migration to new protocols and technologies
without requiring new ASICs. Therefore, the network processor generally
incorporates multiple general purpose processors and special hardware-assist
engines, such as hashing, tree structure, check sum data, filtering, and classifier for
security, and so on. Figure 4-1 presents an overview of the architecture of the Intel
IXP1200 network processor.
The IXP1200 is composed of a Strong ARM microprocessor, six independent
32-bit RISC engines (Microengines) with hardware multithread support, standard
memory interfaces, and high-speed bus interfaces with Media Access Control (MAC)
layer devices and PCI. It can replace the host processor and all of ASICs in the ASIC
based router system. The programmable Microengines make it easy to add new
functionality by software update instead of hardware modifications. The high-speed
bus interface for packet transferring is called Internet Exchange (IX) bus interface
and provided by Fast Bus Interface (FBI) unit. FBI also includes scratchpad RAM
and hash unit to support packet processing. In addition, IXP1200 can connect to
SDRAM to store packets coming from MAC device and SRAM to store heavily used
data structures like the FIB lookup tables. Requests of access to SDRAM or SRAM
are queued in the SDRAM and SRAM units respectively by executing a specific
reference instruction. Each Microengine can directly access to the SDRAM unit,
31
SRAM unit, and the FBI unit via two separate 32 bit internal buses. Besides, PCI
bus also could be used to connect external MAC devices instead of IX bus. In the
following sections, Microengine and FBI unit are described in more details.
IntelStrong ARM
Core16 KbyteI-cache
8 KbyteD-cache
512 KbyteMini-Dcache
Write-Buffer
Read Buffer
JTAG
PCI Unit
32-bit bus
UART 4 TimersGPIO RTC
SRAM Unit32-bit bus
SDRAM Unit
FBI Unit
ScratchpadMemory(4 Kbyte)
64-bit bus
Micro-engine
1
Micro-engine
2
Micro-engine
3
Micro-engine
4
Micro-engine
5
Micro-engine
6
64-bit bus
Notes: 32-bit Data Bus32-bit ARM System Bus
IX BusInterface
Hash Unit
IntelStrong ARM
SA-1Core
Figure 4-1. Architecture of the Intel IXP1200
4.1 Microengine Architecture
The six Microengines can perform total 24 threads associated with fast path
processing of routers without support from the StrongARM core. Each Microengine
has four independent program counters, hardware support for very low overhead
32
context switching (minimum overhead: 1cycle), a programmable 4kbytes instruction
memory, 128 32-bit general purpose registers, 128 32-bit transfer registers and an
ALU and shifter that is capable of performing an ALU and shifter operation in a
single cycle. The instruction set was specifically designed for networking
applications. Figure 4-2 depicts an overview of a Microengine.
Figure 4-2. Microengine Architecture
The 128 general purpose registers can be addressed by using relative or
absolute addressing. Relative addressing divides the registers into the four threads,
ContextArbiter
and Event
Processor
Other CSRs
Prgrm ctr 0
Prgrm ctr 1
Prgrm ctr 2
Prgrm ctr 3
Instr Decode
MicroprogramControl Store
CommandReference FIFO
32 SDRAM REGSREAD XFER
32 SRAMREGSREAD XFER
32 SRAMREGSWRITE XFER
32 SDRAMREGSWRITE XFER
64 A – Side RegsGeneral Purpose
64 B – Side RegsGeneral Purpose
Shifter
A side mux B side mux
uEngineController
Commands To Other Functional UnitsTo SRAM DATA and other writeable destinationsFrom SRAM and other readable sourcesTo SDRAMFrom SDRAM
ALUMicroengineInternal structure
Event Signals
33
absolute addressing allows a register to be shared. The registers are single-ported
only and divided into two banks in order not to impair performance.
The transfer registers are used to store data that has been read in from memory
or a device on the IX Bus and data that is going to be written out to memory or a
device on the IX bus. The transfer registers are divided into multiple banks as well.
There are 32 SRAM read, 32 SRAM write, 32 SDRAM read, and 32 SDRAM write
registers in a Microengine. A Microengine can issue a data transfer between
multiple transfer registers and the SRAM or SDRAM with a single command.
An ALU and shifter perform standard arithmetic and logic functions, plus some
unique instructions that are useful in packet processing. Because of this, the
Microengine is capable of accomplishing sophisticated packet processing in a single
instruction that would take several instructions in a general RISC processor. Each
Microengine contains a set of control and status registers. The StrongARM core uses
these registers to program, control, and debug the Microengines. The instructions
used in the Microengine can be classified into five categories, 1) Arithmetic, Rotate
and Shift Instructions, 2) Branch and Jump instructions, 3) Reference instructions,
4) Local register instructions, 5) Miscellaneous instructions. In appendix B, the
instruction set of a Microengine is described in detail.
The Microengine maintains four program counters. Only one of these may be
active at any given time. This enable the Microengine to keep track of four separate
threads, or executing processes. Threads can use and share the same code in
program store, or they can have separate code, or some of each. The register groups
are all broken into four separate sets, so that each thread can easily maintain its
34
own context. A running thread must voluntarily suspend itself for another thread to
start. This is called cooperative multitasking. The thread will normally swap itself
out while it is waiting for something external to it to occur, for example a read from
SDRAM to be complete. To accomplish this, the programmer just indicates what
condition the thread is waiting for and tells the process to swap. The Microengine
controller will then move onto the next thread that is ready to run. It will eventually
swap out, and if no other thread is ready to run, the first thread will begin running.
As such, the Microengine can hide long latencies caused by referencing off-chip
memory.
4.2 FBI Unit Architecture and IX Bus Interface
The FBI Unit contains receive and transmit FIFO buffer (RFIFO and TFIFO),
4kbytes scratchpad RAM, Push and Pull Engine corresponding to 8 entry Pull
command queue and 8 entry Push command queue, and 48 or 64-bit hardware hash
unit with 8 entry hash command queue. In fact, RFIFO and TFIFO can be a
misnomer because they are memories that can be accessed in any order. They can
act as a circular buffer that is traversed by the pointer. Each FIFO has a collection of
16 elements, and each element includes 10 quadwords (i.e.10 x 64bits). RFIFO
element is composed of 8 quadwords (64bytes) for data, 1 quadword for status, and 1
quadword for extended data field. TFIFO element is made of 8 quadwords (64bytes)
for data, 1 quadword for control, and 1 quadword for prepend field. In addition, the
FBI also controls the 64bit IX Bus interface. It consists of transmit and receive state
35
machine which operate independently and in parallel, ready bus sequencer, and IX
Bus arbiter. Besides, the FBI Unit contains control and status register (CSR)
accessible by both the Microengine and the StrongARM core. Figure 4-3 presents
FBI Unit architecture.
Figure 4-3. FBI Unit Architecture
Figure 4-4 illustrates relationship between Ready bus and a MAC device. A
typical MAC device (such as the Intel@ 21440 Octal 10/100 Mbps Ethernet
controller) usually provides transmit and receive Ready flags that indicate whether
the amount of data in a FIFO has reached a certain threshold level. The Ready Bus
Sequencer in IXP1200 periodically polls the Receive and Transmit FIFO Ready
8 commandPull Queue
8 commandHash Queue
8 commandPush Queue
fast _ wr
AMBA (Core) Command BusMicroengine Command Bus
TFIFO16 elements
(10 quadwords each)
CSRs
From SDRAM
Pull Engine
TFIFO RdCRS/ScratchHash RdPull Command
CRS/Scratch
Hash Return
Push commandRFIFO
From SRAMMicroengineWrite TransferRegister
To SRAMMicroengineRead TransferRegister
To SDRAM
Push Engine1k x 32Scratchpad
Hash Unit
IX Bus Interface
Ready BusSequencer
TransmitState Machine
ReceiveState Machine
IX Bus Arbiter
64-bit IX Bus
Ready Bus
RFIFO16 elements
(10 quadwords each)
Push and Pull Engine Arbiters
36
Flags and places them into FBI registers (RCV_RDY and XMIT_RDY). Software, for
example semaphore, then reads those flags and figures out whether the
corresponding port is ready for receiving or transmitting. As just described, FBI unit
manages multiple ports sharing with combined effort of MAC devices.
Figure 4-4. Ready Bus and Ready Flags
4.3 Microengine Pipelining
In IXP1200, each Microengine executes five-stage pipeline operation shown in
Figure 4-5. The pipeline is composed of P0=Instruction Fetch (F), P1=Decode (D),
P2=Read operands(R), P3=Execute (E), and P4=Write (W). Each stage takes one
Microengine cycle. In F-stage, there are four program counters (PCs) multiplexed for
37
operating four threads in a Microengine. The context arbiter receiving signal events
determines next instruction address, and then the next instruction is fetched from
4kbytes Microprogram store. D-stage decodes the fetched instruction, and passes
the immediate part to the next stage if necessary. In R-stage, two operands are
respectively read from GPRs, SDRAM transfer registers (SDRAM xfers), SRAM
transfer registers (SRAM xfers), Pipe Latch D/R, or Pipe Latch E/W. E-stage
performs ALU or shift operation for four threads sharing ALU and shifter. In
W-stage, the result of ALU/Shift operation is written into SDRAM xfers, SRAM xfers,
or GPRs.
Figure 4-5. Microengine Pipeline
In the micro architecture, there is no structural hazard basically because a
38
Microengine doesn’t have the data fetch stage normally included in general purpose
RISC CPUs and therefore F-stage doesn’t share memory with other stages. A
Microengine accesses Microprogram store to reads an instruction only in F-stage.
Regarding Data Hazard, this pipeline operation originally doesn’t have Write After
Read (WAR) and Write After Write (WAW) hazards since it takes five-stage pipeline
and Read/Write position is respectively specified in the pipeline. However, this
architecture adopts data forwarding from E-stage to E-stage and from E-stage to
R-stage so that it can avoid Read After Write (RAW) hazards as you can see in
Figure 4-5. Moreover, there are several control hazards caused by branch
instructions and context switches (not shown in Figure 4-5). The influence of the
control hazards is dependent on the type of instructions. In Section 4.5, branch and
context switch decision mechanism are described, and how they generate aborted
cycles is shown.
4.4 Memory access
One of unique characteristics of Microengine is the way for accessing memory
like SDRAM and SRAM. When a Microengine transfers data to and from a memory,
the data goes through either SDRAM or SRAM transfer registers (xfers) instead of
directly accessing. Figure 4-6 depicts memory access flow of SDRAM and SRAM. For
a SDRAM write operation, a Microengine first stores the write data into the SDRAM
transfer registers and then issuing SDRAM write command to the SDRAM unit. The
memory controller, namely the SDRAM unit, executes DMA-like transfer from the
39
appropriate SDRAM transfer registers to the SDRAM at the time. For a SDRAM
read operation, a Microengine issues the read request to the SDRAM unit at first.
Then, the SDRAM unit pulls the data out of the SDRAM and deposits it in the
specified SDRAM transfer registers. When the Microengine is notified that the data
has been written into transfer registers, it can then read the data out of its transfer
registers. Similarly, the SRAM unit operates transfer between the SRAM and the
SRAM transfer registers. In fact, the SRAM unit transfer covers other resource’s
transfers such as R-FIFO, T-FIFO, CSRs, Hash Unit, and Scratchpad Memory. As a
result, the context switch can be realized with little overhead and hide a lot of stall
cycles based on the combination of these transfer registers and other context
resources (i.e. PCs and GPRs), and context switch arbiter. Strictly speaking, there
could be one cycle as a maximum overhead. The reason why context switch has one
cycle overhead is addressed in Section 4.5.
Figure 4-6. Memory Access flow
40
4.5 Branch and Context Switch Mechanism
This section describes branch decision mechanism and branch penalties
resulting from control hazards in the execution pipeline. In addition, since the
context switch can also cause control hazards as well and perform in an analogous
way to branch instructions, both of them are explained here at the same time.
Moreover, three supportive solutions are introduced for avoiding branch penalties in
the Microengine; deferred branches, setting condition codes earlier, and branch
guessing.
First of all, the branch instructions can be categorized into three classes shown
in Table 4-1. Only Class1 instructions actually include context switch instructions.
The branch decisions are made in either the P1 (D-stage), P2 (R-stage), or P3
(E-stage) based on these classes. I explain about how each class performs branching.
Table 4-1. Instructions Categorized by Class
Class 3 Class2 Class1
br_bclr and br_bset br=0 br sdram
br=byte and br!=byte br!=0 br=ctx sram
jump br>0 br!=ctx hash1_48
rtn br>=0 ctx_arb hash2_48
br_!signal br<0 csr hash3_48
br_inp_state br<=0 r_fifo_rd hash1_64
br=cout t_fifo_wr hash2_64
br!=cout scratch hash3_64
Note: Blue colored instructions indicate context switch instructions.
41
4.5.1 Class3 Instructions
Class3 instructions always make the branch decision in the P3 E-stage.
Figure4-7 represents an example of a pipeline operation of a class3 instruction.
Since the instruction, branch on bit clear (br_bclr), takes a branch on the basis of
whether the specified bit of the register is clear or set, alu operation must perform
and set the condition code before br_bclr execution. In fact, class3 instructions
include not only instructions requiring the condition code set but also instructions
not requiring the condition code set. If necessary, the condition code is set in the
E-stage of another instruction, and the result is passed to the next E-stage for the
branch instruction. The condition code is once latched and then the branch
instruction determines if the branch should be taken or not taken. If not taken, the
pipeline goes through as a normal pipeline stream. However, if taken, three
instructions after the branch instruction are squashed and aborted, and then the
pipeline is started from the target instruction. In other words, the Microengine has
the control hazard and the class3 instructions normally have three branch penalty
cycles.
Figure 4-7. Branch pipeline example with class3 instruction
42
4.5.2 Class2 Instructions
Class2 instructions make the branch decision in either D-stage or R-stage. The
decision depends on when the condition code is set. I show two possible cases in
Figure 4-8 and 4-9. The condition code is generated by alu instruction in the E-stage
as well. In the class2 instructions, the condition can be directly passed into a branch
decision stage (either R-stage or D-stage) without a pipeline latch. Suppose the
branch instruction executes right after an instruction sets the condition code, the
branch decision is located in the R-stage (Figure4-8). If not taken, the pipeline
performs as normal. If taken, two instructions are aborted from the next stage of the
branch decision, and then the target instruction is fetched. Therefore, we have two
branch penalties in this case. Suppose an instruction is inserted between the
condition code set instruction and the branch instruction, the earlier branch decision
causes one instruction to be aborted if taken (Figure4-9). If not taken, the pipeline
just goes straight.
Figure 4-8. Branch pipeline example with class2 instruction (case1)
43
Figure 4-9. Branch pipeline example with class2 instruction (case2)
4.5.3 Class1 Instructions
Class1 instructions can be classified into two groups; branch instructions and
context switch instructions. The context switch instructions change the execution
context as well as branch to the next instruction that is to be executed in another
context. In class1 instructions, the branch decision is made in the D-stage, after the
initial decoding of the instruction (Figure4-10). The context switch decision is also
similarly made in the D-stage, and then the result is sent to context arbiter. Once
the instruction is decoded, all the information is available to make the branch
decision. If branch is not taken or context is not switched, the pipeline execution
performs in a straight without squashing any instruction. If taken or switched,
there is one penalty because the branch decision can not be made before D-stage.
Figure 4-10. Branch pipeline example with class1 instruction
44
4.5.4 Solutions for branch penalties
There are some solutions for branch penalties. First of all, the branch
instructions of Microengines support deferred branch alternatives, which use the
“defer” optional token within an instruction. Although software programmers can
manually set the option for each branch instruction, the IXP1200 Assembler
supports an optimization that automatically performs deferred branch optimization.
The deferred branch option can reduce or eliminate aborted instructions in the
execution pipeline. In a deferred branch, an instruction following a branch decision
is allowed to execute before the branch takes effect. Figure4-11 presents the pipeline
in the case that the deferred branch is taken. Since the instruction is class3, the
deferred optional token can fill up to three instructions before the pipeline is
branched and hide the branch latency. The number of instructions that can be
deferred depends on the instruction class. This option can be applied to context
switch instructions as well. As a result of using deferred branches, the computation
efficiency could be improved considerably.
Figure 4-11. Branch pipeline example with deferred branch instruction
45
Secondarily, regarding class 2 instructions, setting condition codes early allows
one aborted instruction to be reduced because the branch decision can be made one
cycle earlier. This would be easily comprehended by comparison between Figure 4-8
and Figure 4-9.
Thirdly, the IXP1200 supports the guess branch that prefetches an instruction
from the branch-taken path before it makes the actual branch decision. This option
is provided by the guess branch optional token within an instruction in the same
way as the deferred branch. Table4-2 shows what instructions can support guess
branch optional token. In Figure4-12, if the guess branch is taken, one aborted cycle
could be generated. However, if guess branch is not taken, two aborted cycles could
be caused by miss branch prediction. We also can combine guess branch with
deferred branch for hiding branch penalty of taken path, which is depicted in
Figure4-13.
Table4-2. Guess Branch Instructions
Supports guess_branch Not support guess_branch
br_bset br=cout br<0 br br!=byte
br_bclr br>0 br=0 br=ctx Jump
br_inp_state br!=cout br<=0 br!=ctx Rtn
br_!signal br>=0 br!=0 br=byte
46
Figure 4-12. Branch pipeline example with guess instruction
Figure 4-13. Branch pipeline example with guess and deferred branch options
47
5. IXP1200 Network Processor Evaluation 5.1 Methodology
What I study in this paper is computer architectural properties of an emerging
well-known network processor, Intel’s IXP1200, especially focusing on packet
processing on the forwarding path. To study the architectural characteristics, I
decide to look at metrics of Microengines, such as Instruction Mix, Latencies in
accessing memory, “Execution, Aborted, Stalled, and Idle” ratio, Cycle per
instruction (CPI), and Throughput on the basis of four workloads; 64bytes, 594bytes,
1518bytes fixed size packet workloads and mixture packet workload. I actually
chose to run my evaluation on the simulator shipped with the IXP1200 development
environment because the actual hardware does not provide any fine-grained
performance information. The simulator also guarantees that there are always
packets available at each input port.
The simulator environment actually consists of 1) GUI interface to all
Microengine tools which is called Workbench GUI, 2) Microcode assembler, 3)
Microcode linker, 4) Debug and simulation engine called Transactor including
IXP1200 Architectural Model and Memory, and 5) API that enables a C, C++, or
Verilog model of an IX bus device (i.e. MAC device) to communicate with the
Transactor and simulate interaction between them, which is called Simulation
Extention. In addition, I employed a reference program written by Microcode
Assembler language, including forwarding implementation of Microengines for the
evaluation. The reference program is named L2L3fwd16, and provided with
48
simulator environment by Intel. Refer to the pseudo codes of receive and transmit
thread main loops in Section 3.3. The simulation programs assume a router with 16
x 100Mbps Ethernet ports, and include Router processing such as input scheduler,
filter, forwarder, IP lookup, and output scheduler. In the program, receive threads
are assigned to Microengine0-3. Hence, sixteen threads can run independently and
be assigned to sixteen ports respectively. Transmit threads are assigned to
Microengine 4-5. Since one thread per Microengine is assigned to output scheduler,
other three threads per Microengine are assigned to transmit tasks running
independently. In the simulation, Microengines operated at 232MHz, and the IX bus
transferred packets at 104MHz. Two 8ports 100Mbps MAC devices (Intel IXF440)
were connected to IXP1200. Then, the bus frequency of SRAM and SDRAM was
116MHz.
Regarding workload, I respectively set each workload up and sent packets to
input ports randomly. During each experimental run, the simulator had to forward
3000 packets. I chose this because of the long running time of simulator (about 1
hour per run in Windows2000 on a Pentium IV 1.6GHz). Even though this number
of packets sounds like small, I’m confident that the results for more than 3000
packets should look similar to the results presented in this paper. The evaluation
results are gathered both from statistics directly given by the simulator and from
scripts written to process the simulator’s output.
49
5.2 Instruction Mix
Six Microengines are simulated on the basis of four workloads, 64 Bytes, 594
Bytes, 1518 Bytes, and mixed Ethernet Packets including IP information. Figure5-1
to 5-3 show the distribution of five categorized Microengine’s instructions; called
Instruction Mix. There are 1) Arithmetic, Rotate, and Shift Instructions, including
move operation and Condition Code set for Branching, 2) Branch and Jump
Instructions, 3) Reference Instructions, which basically support data transfer
between memory, such as SRAM, SDRAM, Scratchpad RAM, RFIFO, TFIFO, and
even CSR (Control Status Register), and SRAM/SDRAM transfer registers in
Microengines, 4) Local Register Instructions, including load instructions of
immediate data with/without shift operation 5) Miscellaneous Instructions, which
includes nop, hashing, and context swapping-out operation. In the simulation,
Microengine 0, 1, 2 and 3 actually dedicates receiving operation, and Microengine 4
and 5 executes transmitting operation. The raw data of Instruction Mix are shown
as spreadsheets in Appendix C.
In Figure 5-1, it turns out that “Arithmetic, Rotate, and Shift Instructions” and
“Branch and Jump Instructions” have high proportion for receive Instruction Mix.
They are mainly used for packet parse, header check and modification. Depending
on the increased chance of header check and modification (i.e. smaller packet size),
alu and alu_shf instructions are used in many cases. Ld_field and ld_field_w_clr
operation (load byte into specific field) also highly affect the proportion of “Local
Register instructions” due to header modification. In addition, the ratio of “Local
50
Register Instructions” could be associated with semaphore operation, that is used
for receive threads in order to control utilization of the buses and internal queues.
Since 64 bytes packet is smaller than others, the frequency of blocking and releasing
resources becomes higher. In fact, immed instructions (load immediate word and
sign extend or zero fill with shift) are regularly used to set and clear control value of
semaphore in GPR. Besides, the reason why the ratio of “Reference Instructions”
slightly increases according to smaller sized packets is that the frequency of reading
Ethernet header from RFIFO to transfer registers goes up because of repeated
header checking. Additionally, the Instruction Mix of mixed packets seems to be
affected by the maximum size packet because the proportion is analogous to the
Instruction Mix of 1518 bytes packets although the average size of packets is 406
bytes.
40.8%
32.5%
30.3%
31.9%
28.0%
37.8%
40.8%
39.8%
16.6%
14.2%
7.6%
5.8%
5.3%
5.6%10.0%
7.2%
7.3%15.2%
16.4%
6.9%
0% 20% 40% 60% 80% 100%
64B
594B
1518B
Mixture
Pac
ket
Typ
es
Instruction Ratio
Arithmetic,Rotate, andShift InstructionsBranch and JumpInstructionsReference Instructions
Local RegisterInstructionsMiscellaneousInstructions
Figure 5-1. Instruction Mix for Receiving Packets
51
Transmit Instruction Mix is presented in Figure5-2. The ratio is more
unchanging than receive Instruction Mix. Especially, three conditions except for
64bytes are very similar. However, there are some points we can find out from the
result. The ratio of “Reference Instructions” on 64 bytes packets workload looks
higher than others because the access to Scratchpad increases. In fact, a transmit
thread reads “Ports with Packets (PWP)” from Scratchpad, which is set by a receive
thread after enqueuing a packet in SRAM, and then the frequency of setting PWP
depends on the packet size of workload. In addition, since a thread handling small
packet has to update transmit queue descriptor frequently by using ld_field_w_clr,
the ratio of “Local Register Instructions” slightly increases. The total distribution of
instructions for router processing is shown as Figure 5-3.
48.2%
51.3%
50.9%
50.7%
30.7%
30.7%
31.1%
31.0%
10.6%
8.2%
8.5%
8.5%
8.2%
8.6%
8.6%
8.7%
2.4%
1.2%
0.9%
1.1%
0% 20% 40% 60% 80% 100%
64B
594B
1518B
Mixture
Pac
ket
Typ
es
Instruction Ratio
Arithmetic,Rotate, andShift InstructionsBranch and JumpInstructionsReference Instructions
Local RegisterInstructionsMiscellaneousInstructions
Figure 5-2. Instruction Mix for Transmitting Packets
52
43.4%
39.8%
38.4%
39.2%
29.0%
35.1%
37.0%
8.6%11.7%
12.0%
13.4%
12.7%36.4%
6.5%
6.9%
6.6%
4.9%
4.7%
6.6%
7.4%
0% 20% 40% 60% 80% 100%
64B
594B
1518B
MixtureP
acke
t T
ypes
Instruction Ratio
Arithmetic,Rotate, andShift Instructions
Branch and JumpInstructions
Reference Instructions
Local RegisterInstructions
MiscellaneousInstructions
Figure 5-3. Instruction Mix for Overall Processing
5.3 Latency
In general, accessing external resources such as SDRAM and SRAM often
causes a serious deterioration of performance for a processor because of the large
latency. This section addresses the IXP1200’s latencies for SDRAM and SRAM. The
cumulative distributions of latencies are respectively presented based on the
simulation of 64bytes packet workload. Note that all the figures in this section have
different horizontal scales.
Figure 5-4 depicts the cumulative distribution graph for the latencies in
accessing SDRAM. The presented data covers only read operation from reference
resources to Microengines. In the graph, Microengines 0 to 3, each processing
receive thread, make the same kind of curve mostly. It means that the memory
bandwidth is almost equally shared among the four Microengines. The minimum
53
latency for referencing SDRAM is 43 cycles, and 50 % of the SDRAM accesses take
at least 75 cycles to finish. Such a long latency results in a long stall for a processor
operation and degrade of performance in general. In the worst case, the latency to
access SDRAM can take up to 220 cycles.
Figure 5-4. SDRAM Latency
Figure 5-5 and 5-6 show the cumulative distribution graph for the latencies in
referencing SRAM. There are two ways to access the SRAM memory; unlocked
(Figure 5-5) and locked (Figure 5-6). The SRAM controller maintains an 8-entry
Content-Addressable Memory (CAM). The CAM is used to protect an area in SRAM
from being accessed by two or more processes (StrongARM core and Microengine
threads) at the same time. The Microengines can access the Read Lock CAM by
using the sram instruction. In the L2L3fwd16 program, all ports have a transmit
0
20
40
60
80
100
40 60 80 100 120 140 160 180 200 220 240
cycles
cum
ula
tive
per
cen
tage
Microengine0Microengine1Microengine2Microengine3
54
queue associated with them. Each transmit queue has a queue descriptor which
contains head and tail pointers and a count of packets in the queue. Since the
queues are shared by receive and transmit threads, the thread first acquires a read
lock prior to modifying the queue descriptor. The read_lock command locks the
address and returns the contents of memory. The memory location is unlocked by
using either the unlock command or the write_unlock command in the Microcode
Assembler. In fact, Microengines 0 to 3 shows data for receive threads and
Microengines 4 and 5 presents data for transmit threads and schedulers.
These two graphical forms look very similar actually. As shown, the shapes of
Microengines 0-3 are similar in two graphs. Microengines 4 and 5 also make the
almost same curve, but they are different from Microengines 0 to 3. The SRAM
latencies are much smaller than the SDRAM latencies. For the unlocked access, the
minimum latency is 16 cycles for Microengines 0 to 3 and 18 cycles for Microengine4
and 5, and then 50% of the access take at most 24 cycles for Microengines 0 to 3 and
21 cycles for Microengine 4 and 5 to complete. For the locked case, the minimum
latency is 20 cycles for Microengines 0 to 5, and about 50% of the SRAM access are
at most 26 cycles for Microengines 0 to 3 and 22 cycles for Microengines 4 and 5 to
complete. As a result, the unlocked access is somewhat faster than the locked access.
The maximum latency for accessing unlocked and locked SRAM memory is 204 and
251 cycles respectively. In reality, IXP1200 has other memory accesses, which could
cause huge latencies. In Appendix D, I illustrate graphs of latency in accessing the
receive FIFO buffer, the Scratchpad RAM, the FBI CSR, and the Hash unit, and all
collected data.
55
0
20
40
60
80
100
15 35 55 75 95 115 135 155 175 195 215 235
cycles
cum
ula
tive
per
cen
tage
Microengine0
Microengine1
Microengine2
Microengine3
Microengine4
Microengine5
Figure 5-5. SRAM Latency (unlocked)
0
20
40
60
80
100
20 40 60 80 100 120 140 160 180 200 220 240
cycles
cum
ula
tive
per
cen
tage
Microengine0
Microengine1
Microengine2
Microengine3
Microengine4
Microengine5
Figure 5-6. SRAM Latency (locked)
56
5.4 Execution, Aborted, Stalled and Idle Ratio
As shown in Section5.2, it turns out that memory access generates very long
latency. In multi processors system sharing a memory, latency generally causes
numerous CPU stall cycles, consumes execution time, and then becomes a
bottleneck of the system performance. However, with hardware multithread assist,
the IXP1200 can hide such long latency dexterously and work out this issue almost
perfectly.
In this section, I present the distribution of execution, aborted, stalled, and idle
cycles for each Microengine. Each distribution for four workloads is shown in from
Figure 5-7 to Figure 5-10. Taken as a whole, Microengines can execute at
apptoximately 60% to 75%, otherwise taking aborted or idle cycles, due to the
hardware multithreading. In fact, stalled ratios are extremely low, and almost zero.
Therefore, these graphs prove an advantage of Microenignes in hiding latency. Since
the same receive program is assigned to Microengine 0 to 3, these distributions are
similar in Microengine 0 to 3. Distributions of Microengine 4 and 5, running the
transmit thread, are alike because of the same reason.
However, there seems to be a room for technically improvement of Microengine.
We actually should not overlook the ratio of aborted cycles. Although aborted cycle is
normally occurred in branch and jump instructions, context switching also causes
the aborted cycles (one cycle overhead per one context switch) as described in
Section 4.5. The number of aborted is not trivial, and in fact looks large; from 26.4%
to 41.7% in Microengine 0 to 3 and from 23.3% to 25.3% in Microengine 4 and 5, of
57
course, even though the impact should be much better than stall overhead in typical
multi processor system. In addition, as the packet size of workload is made larger,
the aborted distribution increases in Microengine 0 to 3. The reason is that the
frequency of memory reference goes up for packet transferring with larger packet
size, a n d context switch overhead affects the distribution. The development of
hardware branch and context switch prediction or speculation technique could
improve the performance of the IXP1200 effectively. Otherwise, the optimization of
the IXP1200 Assembler could be needed.
A multithreading example is presented in Appendix E. The illustration shows
history window in GUI workbench, and indicates how four threads operate in each
Microengine 0 to 2.
73.4
73.7
69.1
69.4
69.3
69.1
25.3
25.1
26.4
26.5
26.4
27.3
4
3.7
3.80.5
0.5
0.4
0.5
1.1
1.2
3.2
0% 20% 40% 60% 80% 100%
Microengine5
Microengine4
Microengine3
Microengine2
Microengine1
Microengine0
ratio
Executing
Aborted
StalledIdle
Figure 5-7. Executing, Aborted, Stalled, and Idle ratio on 64bytes Workload
58
75.8
76.1
60.2
60.3
60.4
60.4
23.6
23.3
37.7
37.7
37.6
37.9
2
0.6
0.7
0.1
0.1
0.1
0.1
1.9
1.9
1.6
0% 20% 40% 60% 80% 100%
Microengine5
Microengine4
Microengine3
Microengine2
Microengine1
Microengine0
ratio
ExecutingAborted
StalledIdle
Figure 5-8. Executing, Aborted, Stalled, and Idle ratio on 594bytes Workload
74.4
74.2
57.7
57.6
57.6
57.8
25
25.3
41.6
41.7
41.6
41.6
0.6
0.5
0.6
0.7
0.7
0.7
0% 20% 40% 60% 80% 100%
Microengine5
Microengine4
Microengine3
Microengine2
Microengine1
Microengine0
ratio
ExecutingAborted
StalledIdle
Figure 5-9. Executing, Aborted, Stalled, and Idle ratio on 1518bytes Workload
59
74.5
75
58.2
58.3
58.2
58.6
24.7
24.4
41.5
41.1
41.5
41.1
0.7
0.7
0.2
0.3
0.3
0.3
0% 20% 40% 60% 80% 100%
Microengine5
Microengine4
Microengine3
Microengine2
Microengine1
Microengine0
ratio
ExecutingAborted
StalledIdle
Figure 5-10. Executing, Aborted, Stalled, and Idle ratio on Mixture Workload
5.5 CPI (Cycle per Instruction)
Figure 5-11 presents CPI of six Microengines for four workloads. The CPI is
limited to 1 by the number of pipelines in each Microengine because of the lack of
out-of-order processing and speculation. Another finding is that the amount of CPI
increases as the size of the data packets increases. The reason is that when the
packet size increases, more time needs to be spent transferring the packet to and
from the various registers and memory locations in the IXP1200, whereas the
overhead for header and route lookup processing stays constant. In addition,
although the average packet size of mixture is 406bytes and less than 594bytes
workload, the CPI values are larger than that of 594bytes in receive threads. The
60
reason could be that mixture workload is affected by aborted as described in Section
5.3 and then increases CPI.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
64BPackets-uEngine5
594BPackets-uEngine5
1518BPackets-uEngine5
MixturePackets-uEngine5
64BPackets-uEngine4
594BPackets-uEngine4
1518BPackets-uEngine4
MixturePackets-uEngine4
64BPackets-uEngine3
594BPackets-uEngine3
1518BPackets-uEngine3
MixturePackets-uEngine3
64BPackets-uEngine2
594BPackets-uEngine2
1518BPackets-uEngine2
MixturePackets-uEngine2
64BPackets-uEngine1
594BPackets-uEngine1
1518BPackets-uEngine1
MixturePackets-uEngine1
64BPackets-uEngine0
594BPackets-uEngine0
1518BPackets-uEngine0
MixturePackets-uEngine0
CPI
Figure 5-11. CPI for Microengines
5.6 Throughput
This section reports the results of throughput for a router with 16 x 100Mbps
ports based on four workloads. First of all, a theoretical throughput of a fixed size
workload can be calculated by next equations.
61
Packet arrival rate per port (pps) = 100Mbps / ((12bytes (IFG*1) + 8bytes
(Preamble/SFD*2) + packet size (bytes)) x 8bits) (1)
Note: *1 IFG: Inter Frame Gap, *2 SFD: Start Frame Delimiter
Throughput per router (pps) = (Packet arrival rate per port) x (# of ports) (2)
Suppose the router receives 64bytes packets at 100Mbps bandwidth for each
port and forwards all of them to output ports, the ideal throughput is estimated as
0.1488Mpps(arrival rate per port) x 16 ports = 2.38Mpps. Likewise, throughput of
other workloads can be calculated. In addition, this simulation assumes that a
packet is always ready in a port without idle. In Figure 5-12, simulation results are
presented with ideal simulation rate and theoretical OC-24 pps (Packet per second)
values according to four workloads. Since the physical bandwidth of OC-24
(1.24Gbps) is close to the simulation bandwidth (16 x 100Mbps = 1.6Gbps), I put the
theoretical value in the graph. As seen, there are no differences between simulation
throughput rates and ideal throughput rates in terms of three fixed size workloads.
The mixture workload seems to have slight gap. Even though I used the average
packet size 406 bytes to calculate the ideal throughput of the mixture workload on
equation (1) and (2), it could not be precise value for mixture workload because three
different sized packets are sent randomly. Otherwise, mixture workload could give
IXP1200 heavier load than fixed size workload.
62
0.40
0.130.33
2.38
0.47
0.130.33
2.38
0.38
0.100.26
2.83
0.00
0.50
1.00
1.50
2.00
2.50
3.00
Mixture 1518bytes 594bytes 64bytes
Mpp
s
Sim Rate
Ideal Sim Rate
OC-24(CRC16)
Figure 5-12. Throughputs (bounded)
In addition, what we find out from the graph is that the theoretical throughput
of OC-24 is higher than the simulation value on 64 bytes packets workload though
the physical bandwidth is lower. The reason is that the protocol overhead of
Ethernet (38bytes) is much larger than that of OC-24 POS (Packet over SONET)
(7bytes). 38-byte means 82.6% overhead, and 7-byte represents 15.2% overhead for
46bytes IP packets. As such, in general, real throughput of IP packet depends not
only on router processing ability but also on media and protocols overhead because
different media and protocols such as Ethernet, SONET, and ATM have different
size of overhead. In Appendix F, the theoretical throughputs of IP packets are
explained based on different encapsulations.
Regarding NP performance evaluation, the simulation results would not be able
to compare with such values directly. However, we could say that the processing
ability of IXP1200 should be enough to forward packets of OC-24 level speed because
63
suppose 64-byte Ethernet formatted packets go through OC-24 class bandwidth, the
throughput is calculated as 1.85 pps, that is lower than the simulation result.
To know how fast packets stream IXP1200 can process, I simulated four
workloads on the condition of unbounded execution. In IXP1200 simulator, there are
two simulation conditions for execution, “bounded” and “unbounded”. The bounded
execution assumes a real system environment, and is actually used to collect data of
Figure 5-12. In this condition, data is received from and transmitted to the network
at the specified data rate with an inter frame gap (IFG). Even if the processing
capability of IXP1200 is over the wire speed, the throughput is converged to the wire
speed.
On the other hand, the unbounded execution can evaluate the maximum packet
processing capability of IXP1200 for infinite wire speed. It means that simulator has
data always ready to be received by the IXP1200 without IFG and has the ports
always ready to receive data from the IXP1200. This makes the simulation act as if
data is coming from and going to the network at infinite speed, bypassing the receive
and transmit buffer in MAC devices. The throughputs on the unbounded execution
are shown in Figure 5-13. The graph includes theoretical throughputs based on the
assumption that Ethernet packets come and go through at 1.244Gbps(OC-24 class)
and 2.488Gbps(OC-48) bandwidth and ignore IFG although it is not real protocol. As
a consequence, IXP1200 could achieve from 70% to 88% throughput compared with
theoretical throughput of OC-48 class. Even though IXP1200 could not be enough to
forward packets of OC-48 at wire rate, the simulation results imply that only one NP
processing ability is obviously approaching to OC-48 class.
64
0.58
0.150.46
3.07
0.380.10
0.26
2.16
0.75
0.200.52
4.32
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Mixture 1518bytes 594bytes 64bytes
Mpp
s Sim Rate
1.244GEther(OC-24class)
2.488GEther(OC-48class)
Note: These throughputs don’t include 12bytes IFG overhead.
Figure 5-13. Throughputs (unbounded)
65
6. Other Network Processors
The microarchitecture of network processors is marked by emphasis on
streaming data throughput and heavy use of architectural parallelism. Chip
multiprocessing with hardware multithreading seems to be a popular technique to
exploit the huge thread-level parallelsm available in packet processing workloads.
In reality, most of the NPs supports fast context switching as well as IXP1200. For
example, IQ2000 has four 32-bit scalar cores with 64-bit memory interfaces, so each
core can perform a double-word load or store in a single clock cycle [2]. Each core has
5 identical register files (32 registers, 32 bits wide, triple-ported). This arrangement
allows each core to run five concurrent threads of execution with fast context
switching. In addition, Xstream Logic’s network processor is based on the dynamic
multistreaming(DMS) technique(also known as simultaneous multithreading) [14].
The processor core can support eight threads. Each thread has its own instruction
queue and register file. The core is divided into 2 clusters of 4 threads each. Every
clock cycle, each cluster can issue up to16 instructions-four from each thread and
four of the 16 are selected and dispatched to one of the four functional units in that
core for execution. The DMS core has 9 pipe stages and features a MIPS like ISA.
This section describes characterization of other Network Processors, especially
focuses on their instruction set, context switching, and branching. In fact, three
popular network processors including different fast context switch features are
introduced ; Lexra’s NetVortex [15], Motrola’s C-port [3], and IBM’s PowerNP [4].
66
6.1 Lexra’s NetVortex
Lexra’s NetVortex is based on 32-bit MIPS-1 architecture and allow up to 8
contexts per processor. Each context includes 32 general registers r0-r31, its own
program counter called CXPC and a status register called CXSTATUS. The status
register allows the program to set the I/O and software events on which the thread
will wait and the priority of the thread. NetVortex actually can perform 18 extended
instructions presented in Table 6-1 in addition to general MIPS-1 instructions. Note
that MIPS-1 Instruction set is shown in Appendix G. In fact, 6 instructions support
context switching among different threads and hides numerous latency of memory
load and store. In addition, the instruction set includes some new bit-field
instructions that make it easier to analyze packet headers.
Figure 6-1 depicts fast context switch mechanism of NetVortex. The LW.CSW
instruction (load word with context switch) is located in the second instruction in
thread 1. NetVortex basically provides a delay slot after a branch or memory
reference and allows one more instruction to execute while the processor is fetching
data as well as MIPS architecture. NetVortex actually switches context to the next
available thread at the delay slot. The context program counter CXPC then displays
the fourth instruction, namely next instruction to the delay slot, and wait status is
set in CXSTATUS at that time. When thread2 begins to run, the CXPC is changed to
PC, which indicates global program counter containing current running instruction,
and CXSTATUS is set as active. When thread 2 encounters the next context-switch
instruction, the whole procedure repeats itself. In fact, when the memory reference
67
is complete, the CXSTATUS of the thread changes its status from wait to ready and
becomes available thread.
Unlike conditional branches of MIPS-1, which the CPU must resolve in the
execute stage, all context switching instructions executes unconditionally. The CPU
discovers a context-switch instructions in the decode stage as well as IXP1200 and
always executes the following instruction (in the delay slot) like deferred option of
IXP1200 in order to avoid creating bubbles. As a result, it seems that programmer
skills or the compiler optimization is very important to fill the delay slot and avoid
the number of penalty cycles by context switching as well as IXP1200. In NetVortex,
only if no other thread is ready to resume execution, the CPU does the pipeline stall.
Table 6-1. NetVortex extended Instruction set
InstructionContext -Control Instructions
Description
MYCXPOSTCX
CSWLW.CSWLT.CSW
WD
WD.CSW
WDLW.CSWWDLT.CSW
Bit-Field InstructionsSETI
CLRIEXTIV
INSVACS2
Cross-Context Access Instructions
MFCXGMTCXGMFCXC
Read my contextPost event to a contextContext Switch
Load word with context switchLoad twinword* with context switchWrite descriptor to device
Write descriptor to device with context switchWrite descriptor to device,load word with context switch
Write descriptor to device,load twinword with context switch
Set subfield to ones
Clear subfield to zeroesExtract subfield and prepare for insertionInsert extracted subfieldDual 16-bit ones complement add for checksum
Move from a context general-purpose registerMove to a context general-purpose registerMove from a context -control registerMove to a context -control registerMTCXC
InstructionContext -Control Instructions
Description
MYCXPOSTCX
CSWLW.CSWLT.CSW
WD
WD.CSW
WDLW.CSWWDLT.CSW
Bit-Field InstructionsSETI
CLRIEXTIV
INSVACS2
Cross-Context Access Instructions
MFCXGMTCXGMFCXC
Read my contextPost event to a contextContext Switch
Load word with context switchLoad twinword* with context switchWrite descriptor to device
Write descriptor to device with context switchWrite descriptor to device,load word with context switch
Write descriptor to device,load twinword with context switch
Set subfield to ones
Clear subfield to zeroesExtract subfield and prepare for insertionInsert extracted subfieldDual 16-bit ones complement add for checksum
Move from a context general-purpose registerMove to a context general-purpose registerMove from a context -control registerMove to a context -control registerMTCXC
Note: Twin words are 64-bit values
68
Thread Context 1(r0 - r31)Thread Context 1(r0 - r31)
Thread Context 2(r0 - r31)Thread Context 2(r0 - r31)
Thread1 CXPC = I4(T1)Thread1 CXSTATUS = WaitThread1 CXPC = I4(T1)Thread1 CXSTATUS = Wait
General PurposeRegister File
General PurposeRegister File
Context RegistersContext Registers
Thread 1 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): …
Thread 2 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): … Thread2 CXPC = PC
Thread2 CXSTATUS = ActiveThread2 CXPC = PCThread2 CXSTATUS = Active
PC = I1(T2)PC = I1(T2)
Context Switch to Thread 2Context Switch to Thread 2
Context Switch to next available threadContext Switch to next available thread
Thread Context 1(r0 - r31)Thread Context 1(r0 - r31)
Thread Context 2(r0 - r31)Thread Context 2(r0 - r31)
Thread1 CXPC = I4(T1)Thread1 CXSTATUS = WaitThread1 CXPC = I4(T1)Thread1 CXSTATUS = Wait
General PurposeRegister File
General PurposeRegister File
Context RegistersContext Registers
Thread 1 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): …
Thread 2 ProgramI1(T1): …I2(T1): LW.CSW (reg, addr)I3(T1): Delay slot instructionI4(T1): Next instructionI5(T1): … Thread2 CXPC = PC
Thread2 CXSTATUS = ActiveThread2 CXPC = PCThread2 CXSTATUS = Active
PC = I1(T2)PC = I1(T2)
Context Switch to Thread 2Context Switch to Thread 2
Context Switch to next available threadContext Switch to next available thread
Figure 6-1. NetVortex Context Switch Mechanism
6.2 Motrola’s C-5
Motrola’s C-5 has a dedicated, programmable 16 Channel Processors (CPs) for
packet forwarding. Each CP consists of a Serial Data Processors (SDP), which
contains microcode-programmable components for receive and transmit processing,
and a Channel Processor RISC Core (CPRC), which performs packet processing via
special purpose instruction and data memory. CPRC actually supports scheduling
and characterizing packets, table lookup, making forwarding and filtering decision.
Besides, CPRC implements a subset of the MIPS-1 instruction set (excluding
multiply, divide, floating point, and Coprocessor Zero (CpO) instructions). Refer to
Appendix G. Even though the standard MIPS CpO instructions are not supported,
C-5 provides its own special purpose Coprocessor Zero registers shown in Table 6-2.
69
The instructions of CPRC can be classified into five; 1) load and store 2) arithmetic
and logical, 3) jump and branch 4) coprocessor interface 5) miscellaneous
To achieve multiplexing processing among a number of different tasks, the
CPRC is configured to incorporate four sets of 32 internal registers and performs
context-switch under software or hardware interrupt. Therefore, C-5 can totally
provide 16 x4 threads for packet processing. Actual processing begins on a different
context in two cycles.
As described, the context switching o f C-5 depends on two ways, software
control or hardware interrupt. In software mechanism, context switching is
executed by the original Coprocessor Zero instructions. For example, MTC0 $1 $3
(where $1 specifies the destination context, and where $3 is the source or current
context) switches context $3 to context $1. Those contexts have no priority.
In hardware mechanism, the CPRC uses prioritized hardware interrupts, that
can be triggered from any bits in two event registers. In fact, hardware interrupts
employ a specific purpose register (K1) containing the program counter value and
the context number of the interrupted context. First of all, all insterrupts are
disabled until a restore from exception (RFE) instruction is executed. In the
interrupted context, the address of the next executed instruction is saved in K1.
Suppose RFE is executed, the program flow returns to the previously interrupted
context. Thus, context switching is performed.
70
Table 6-2. C-5 Coprocessor Zero Register Definitions
Register Definition
R0 Whoami Register – Contains the DMEM base( hardcoded) for this CPRC
R1 Interrupt Table Register – Contains the vector address for INT 0
R2 Break Table Register – Contains the vector address for break 0
R3 Current Context Register – The two LSBs are the current context register
R4 DMEM Comparison Address – Contains the address at which debug pulse is generated
R5 DMEM Comparison Address Mask – Contains the mask for the DMEM address
R6 DMEM Comparison Data – Contains the data value for which debug pulse is generated
R7 DMEM Comparison Data Mask – Contains the mask for the DMEM data
R8 Interrupt Flag – The LSB in the Interrupt Flag
R9 Read/Write Mask – The two LSBs are the Read mask and the Write mask for R4 to R7
Register Definition
R0 Whoami Register – Contains the DMEM base( hardcoded) for this CPRC
R1 Interrupt Table Register – Contains the vector address for INT 0
R2 Break Table Register – Contains the vector address for break 0
R3 Current Context Register – The two LSBs are the current context register
R4 DMEM Comparison Address – Contains the address at which debug pulse is generated
R5 DMEM Comparison Address Mask – Contains the mask for the DMEM address
R6 DMEM Comparison Data – Contains the data value for which debug pulse is generated
R7 DMEM Comparison Data Mask – Contains the mask for the DMEM data
R8 Interrupt Flag – The LSB in the Interrupt Flag
R9 Read/Write Mask – The two LSBs are the Read mask and the Write mask for R4 to R7
6.3 IBM’s PowerNP
IBM’s PowerNP integrates 16 32-bit picoprocessors with 1 PowerPC core in a
single chip. Each picoprocessor has support for 2 hardwre threads. Each thread has
16 32-bit (or 32 16-bit) General Purpose Registers (GPRs). Two picoprocessors are
packed in a dyadic protocol processor unit (DPPC) and shares eight coprocessors
such as a tree search engine, semaphore, checksum, data store and so on. Four
threads actually performs context switching in a cluster.
In fact, each picoprocessor has a one-cycle ALU shared by two threads and
performs packet processing by the core instruction set, properly operation codes
(opcodes). The opcodes fall in four categories; 1) ALU opcodes, 2) control opcodes, 3),
data movement opcodes, and coprocessor execution opcodes. ALU opcodes are
categorized into five types; 1) Arithmetic immediate, 2)Logical immediate,
3)Compare immediate, 4) Load immediate, 5) Arithmetic/Logical Register, 6) Count
Leading Zeros. Regarding conditional branch operation of control opcodes, the
71
operation depends on condition codes. All opcodes and condition codes are presented
in Appendix G.
Context switching occurs when the picoprocessor is waiting for a shared
resource (for example, waiting for one of the coprocessors to complete an operation,
return the results of a search, or access DRAM). Basically, the context switching is
handled by coprocessor execution opcodes. Figure6-2 shows the pseudo code of the
wait opcode as an example of coprocessor execution opcode. The wait opcode
synchronizes one or more coprocessors. The mask 16 field is a bit mask (one bit per
coprocessor) in which the bit number corresponds to the coprocessor number. The
thread stalls until all coprocessors indicated by the mask complete their operations.
Priority can be released with this command. The context switching actually doesn’t
have overhead between threads. Since eight coprocessors performs primary router
processing and reduces the path length of picoprocesser, there should be a big
advantage for parallel processing.
IF Reduction_OR(mask16(i) = coprocessr. Busy(i))THEN
PC <= stall
ELSE
PC <=PC +1
END IF
IF p=1 THEN
Priority Over(other thread)<= TRUE
ELSE
PriorityOwner(Other thread)<= PriorityOwner(Other thread)
END IF;
Figure 6-2. Coprocessor Execution Opcode Example (Wait Opcode)
72
7. Conclusions and Future work
This paper has addressed the characterization of router functions and
workloads for evaluation of Network Processors (NPs). Based on four proposed
workloads; 64bytes, 594byets, 1518bytes and mixture, the analytical data of the
IXP1200 has shown some critical features of the microarchitecture associated with
router processing and the performance. It has presented that “Arithmetic, Rotate,
and Shift” and “branch and jump” instructions occupy high proportion for the
Instruction Mix. Even though the ratio is not different obviously on the different
sized packet workloads, the simulation result has shown that the proportion slightly
depends on them. Especially in the receive thread, as the size of packet increases,
the ratio of “Arithmetic, Rotate, and Shift” and “local register” increase due to
frequency of packet header processing. In addition, the simulation has presented
that IXP1200 almost completely hides huge latencies of memory reference
instructions with fast context switching. However, another critical issue has come
up. Aborted cycles occurred by branch and context switch are not small. We would
not be able to leave the issue alone. To reduce those cycles effectively and improve
performance, some dynamic hardware prediction or speculation could be necessary
for the NP in the future, otherwise the optimization of assembler and compiler.
Since NPs generally include a number of RISC cores and other network processing
components, it could be expensive to apply such techniques. Therefore, if considering
use of prediction or speculation technique, it would be necessary to apply a small
prediction buffer or history table as possible. Thus, there seems to be room for
73
technological improvement of NPs’ hardware context switching and branching.
In addition, this paper demonstrated that the amount of CPI increases as the
size of the data packets increases particularly in the receive thread because the
large sized packet needs more time to transfer even though overhead for header and
lookup processing stays constant. Besides, Mixture workload seems to be affected by
the 1518 sized packets because the result is always close to the workload although
the average packet size is less than 594bytes.
In the bounded throughput evaluation, IXP1200 has succeeded in achieving
ideal throughput of 2.38 Mpps on the basis of minimum sized packet. In the
unbounded throughput, IXP1200 has achieved 3.07Mpps, which constitute
approximately 71% of theoretical throughput based on the virtual use of Ethernet in
OC-48 class physical line. In conclusion, only one NP processing ability is enough for
OC-24 but still not enough to accomplish OC-48 at the wire speed. However, since
the ability is obviously approaching to such kind of level, one NP will accomplish
OC-48 and more in the near future.
74
8. Bibliography
[1] Intel Corporation. IXP1200 Network Processor Datasheet, December 2001
[2] T. Halfhill. Sitera Samples Its First NPU. Microprocessor Report, May 2000
[3] C-Port Corporation (Motrola). C-5 Network Processor Architecture Guide, May
2001
[4] IBM Microelectronics Division. Power NP NP4GS3 Network Processor
Datasheet, February 2002
[5] Internet backbone maps. http://www.nthelp.com/maps.htm
[6] ISP world. http://www.boardwatch.com/isp/bb/Backbone_Profiles.htm
[7] Cable & Wireless Global Internet backbone. http://www.sla.cw.net/sla/index.jsp
[8] Howard C. Berkowitz. Designing Routing and Switching Architecture for
Enterprise Networks, 1999
[9] Larry L. Peterson and Bruce S. Davie. Computer Networks Second Edition,
2000
[10] San Diego Super Computer Center. http://www.sdsc.edu/
[11] Agilent Technologies. http://advanced.comms.agilent.com/routertester/
[12] National Laboratory for Applied Network Research (NLANR), Measurement &
Operations Analysis Team. http://moat.nlanr.net/Datacube
[13] RFC879. http://www.faqs.org/rfcs/rfc879.html
[14] Linda Geppert. The New Chips on the Block. IEEE Spectrum, January 2001
[15] Tom R. Halfhill. Lexra’s NetVortex Does Networking. Microprocessor Report,
July 2000
75
[16] Tom R. Halfhill. Intel Network Processor Tarets Routers. Microprocessor Report,
September 1999
[17] Intel Corporation. IXP1200 Network Processor Family Microcode Programmer’s
Reference Manual, December 2001
[18] Intel Corporation. IXP1200 Network Processor Family Hardwre Reference
Manual, December 2001
[19] Intel Corporation. IXP1200 Network Processor Family Development Tools
User’s Guide, December 2001
[20] David A. Patterson and John L. Hennessy. Computer Architecture A
Quantitative Approach Second Edition, 1996
[21] David A. Patterson and John L. Hennessy. Computer Organization & Design,
1998
[22] Tammo Spalink, Scott Karlin, Larry Peterson, Yitzchak Gottlieb. Building a
Robust Software-Based Router Using Network Procesors. 18th ACM Symposium
on Operating Systems Principles (SOSP’01), pages 216—229, October 2001
[23] Vitesee Semiconductor Corporation Samuel J. Barnett and Mark R.Fauber.
Network Processors Uncovering Architectural Approaches for High-Speed
Packet Processing, 2000
[24] Vitessee Semiconductor Corporation. IQ2000 Network Processor Product Brief,
2000
[25] K. Krewell. Agere’s Pipelined Dream Chip. Microprocessor Report, June 2000
[26] Tom R. Halfhill. Alliance Detours Into Routers. Microprocessor Report,
August 1999
76
Appendix A: Pseudo Code
//**************************** Format of RCV_RDY_LO **************************
rr: Receive Ready Flags corresponding to each port
//****************************************************************************
receive_ready_check()
{
check_port:
recrdy_inflight_blocked:
if (@recrdy_inflight == SEMAPHORE_OPEN)
goto set_rec_ready
else
goto recrdy_inflight_blocked
end if
set_rec_ready:
@recrdy_inflight = SEMAPHORE_CLOSE
copy $rec_rdy <- CSR_RCV_RDY_LO
@recrdy_inflight = SEMAPHORE_OPEN
if (0 != (1 & ($rec_rdy >> rec_req(lower 5bits))))
goto receive_request
else
goto check_port
end if
receive_request:
return
}
Figure A-1. Receive Ready Check
//**************************** Format of RCV_REQ **************************************
31: 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr rr
31 30:29 28:27 26 25:22 21:18 17:16 15 14 13 12 11 10:6 5:3 2:0
RES FA TSMG SL E2 E1 FS NFE RES IGFR RES SIGRS TID RM RP
77
RES: Reserved, FA: Maximum IX Bus Accesses, TSMG: Thread Message, SL: status length,
E2: Element2, E1: Element1, FS: Fast/Slow port mode, NFE: Number of FIFO Elements
IGFR: Ignore Fast Ready Flag, SIGRS: Signal Receive Scheduler, TID: Thread ID,
RM: Receive MAC, RP: Receive Port
//*************************************************************************************
receive_request()
{
port_rx_init(rec_req, rfifo_addr) // create canned receive_request
req_inflight_check:
if (@req_inflight == SEMAPHORE_OPEN)
goto set_rec_req
else
goto req_inflight_check
end if
set_rec_req:
@req_inflight = SEMAPHORE_CLOSE
$rec_csr = rec_req
copy $rec_csr -> CSR_RCV_REQ
return
}
Figure A-2. Receive Request Issue
78
//**************************** Format of RCV_CTL *********************************
THMSG: Thread Message, MACPORTTHD: MAC Port Number/Header Thread ID,
SOPSEQ: Start of Packet Sequence Number, RF: Receive Fail, RERR: Receive Error,
SE: Second Element,FE: First Element, EF: Element filled, SN: Sequence Number,
VLDBytes: Valid Bytes, EOP: End of Packet, SOP: Start of Packet
//********************************************************************************
receive_status()
{
signal_receive:
wait_start_receive()
copy $rec_csr <- CSR_RCV_CNTL
if ( ($rec_csr & 1) >0)
goto sop
else
exception = 0x3 & ($rec_csr >>18) // save RF and RERR (RCV_CNTL[19:18])
rec_state = (rec_state, byte_enable(1101)) + (($rec_csr << 8), byte_enable(0010))
// save VLDBytes, EOP, SOP (RCV_CNTL[7:0])
goto done
end if
sop:
rec_state = 0 + (($rec_csr << 8), byte_enable(0010)) // initialize VLDBytes, EOP, SOP (RCV_CNTL[7:0])
done:
return
}
Figure A-3. Receive Packet Status Acquisition
79
pkbuf_allocate()
{
xbuf_alloc($pop_xfer, 1 lword)
$pop_xfer [0] = buf_pop(PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
while ($pop_xfer [0] == SRAM_DESC_BASE)
$pop_xfer [0] = buf_pop(PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
end while
pkbuf_addr = cal_pkbuf_addr($pop_xfer [0],PKBUF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
desc_addr = $pop_xfer
xbuf_free($pop_xfer)
}
Figure A-4. Packet Buffer Allocation
port_rx_fail_error_check()
{
if (exception == PORT_RXFAIL)
inc_rx_fail_count_and_total_discard()
// increment exception counter and behave like EOP, except that the packet is not queued
pkbuf_addr = cal_pkbuf_addr(desc_addr,PKBUFF_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
@req_inflight = SEMAPHORE_OPEN
continue
else if (exception == PORT_RXERROR)
inc_rx_error_count() // increment exception counter
pkbuf_addr = cal_pkbuf_addr(desc_addr,BUFFE_BASE,PKBUF_SIZE,DESC_BASE,DESC_SIZE)
@req_inflight = SEMAPHORE_OPEN
continue
end if
@req_inflight = SEMAPHORE_OPEN
return
}
Figure A-5. Port Fail/Error Check
80
//********************* Format of Ethernet/802.3 ***************
*Note: min 46bytes and max 1500bytes in Ethernet
//***************************************************************
get_mpkt_header()
{
xbuf_alloc($pkt_buf, 4 lwords) // $pkt_buf[0]-[3] for 16bytes header
xbuf_alloc($pkt_buf_eth , 2 lwords) // for SNAP header
xbuf_link($pkt_buf, $pkt_buf_eth )
copy $pkt_buf <- RFIFO(addr(rfifo_addr + QWOFFSET0), size(3quadwords))
#if little endian
sa01 = $pkt_buf [1] >> 16
#else
sa01 = 0 + $pkt_buf[1](LS16bit) // for later merge
#end if
extract proto_len <- $pkt_buf [3](addr(BYTEOFFSET0 + 12), size(2bytes))
return
}
Figure A-6. MAC Packet Header Acquisition
Destination Address
Source Address 0:31
Source Address 0:15 16:31
Destination Address 0:31
EtherType/Length 0:15
Data*/LLC
Data*/LLC
FCS 0:31
Long word1
Long word0
Long word2
Long word3
Long word 4-14(min)* Long word 15(min)*
81
parse_packet()
{
ethertype = 0
if proto_len < 1500 // 802.3(length)
extract eth_llc1 <- $pkt_buf (addr(BYTEOFFSET0 + 14), size(1byte))
extract eth_llc2 <- $pkt_buf (addr(BYTEOFFSET0 + 15), size(1byte))
eth_llc1 = eth_llc1 & eth_llc2
extract eth_llc2 <- $pkt_buf (addr(BYTEOFFSET0 + 16), size(1byte))
if eth_llc1 == 0xAA
if eth_llc2 == 0x03
extract ethertype <- $pkt_buf (addr(BYTEOFFSET0 + 17), size(3bytes))
end if
end if
if ethertype > 0
pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_LLCSNAP)
else
pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_LLC)
end if
else // Ethernet(type)
pkstate = pkstate | (TRUE << SHIFT_PKTLINKTYPE_ETHERNET)
ethertype = proto_len
end if
xbuf_free($pkt_buf_eth )
return
}
Figure A-7. Parse Packet
82
ethertype_classifier()
{
if (ethertype == 0x0800) // Internet Protocol(IP)
IP_forwarder ()
else if (ethertype == 0x0805) // X.25
X25_forwarder ()
else if (ethertype == 0x0806) // Address Resolution Protocol(ARP)
ARP_forwarder ()
else if (ethertype == 0x8137) // IPX
IPX_forwarder ()
else if (ethertype == 0x809B) // Appletalk over Ethernet
Appletalk_forwarder ()
end if
return
}
Note: This code is not included in L2L3fwd16
Figure A-8. Ethertype Field Classifier
ether_filter(ethertype,$pkt_buf)
{
pkt_state = 0, pkaction = 0
rec_port_num = rec_req & 0x1F
ether_port_info(rec_port_num )
xbuf_alloc($$dxfer, 8)
xbuf_link($$dxfer, $$dxfer)
xbuf_alloc($hash_buf, 4 lwords)
//in_port_filter_type - L2 filtering options
// 00 Ethertype based filtering
// 01 Explict rule, action specified in SDRAM fwd entry
// 10 Positive filtering, action implied by presense/absense of filter rule
// 11 Negative filtering, action implied by presense/absense of filter rule
// setup for da hash
extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 0), size(2bytes)) // load DA bytes 0-1
$hash_buf [0] = ethfilt_tempa
83
extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 2), size(4bytes)) // load DA bytes 2-5
$hash_buf [1] = ethfilt_tempa
extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 6), size(2bytes)) // load SA bytes 0-1
$hash_buf [2] = ethfilt_tempa
extract ethfilt_tempa <- $pkt_buf (addr(BYTEOFFSET0 + 8), size(4bytes)) // load SA bytes 2-5
$hash_buf [3] = ethfilt_tempa
hash2_48($hash_buf) // two 48-bit Hash operation for DA and SA
hash0 = $hash_buf [0]
hash1 = $hash_buf [1]
hash2 = $hash_buf [2]
hash3 = $hash_buf [3]
// Hash Table Level 1 lookup
ethfilt_tempb = SRAM_L1_ADDR_HASH_BASE
ethfilt_tempa = 0 + (hash1, byte_enable(0011)) // table index
copy $hash_buf [0 ] <- SRAM(addr(ethfilt_tempa + ethfilt_tempb), size(1longword)) // lookup DA
ethfilt_tempa = 0 + (hash3, byte_enable(0011)) // table index
copy $hash_buf [1] <- SRAM(addr(ethfilt_tempa + ethfilt_tempb), size(1longword))// lookup SA
da_lookup_result = $hash_buf [0] // check results of DA lookup
sa_lookup_result = $hash_buf [1] // check results of SA lookup
xbuf_free($hash_buf)
// Hash Table Level2 lookup
ethfilt_tempb = 0x1 & (da_lookup_result >> 31) // Check for collision bit
if (ethfilt_tempb == 1)
da_lookup_result = hash_resolve(hash0,hash1,SRAM_L2_ADDR_HASH_BASE) // get da_lookup_result on L2
lookup
end if
ethfilt_tempb = 0x1 & (sa_lookup_result >> 31) // Check for collision bit
if (ethfilt_tempb == 1)
sa_lookup_result = hash_resolve(hash2,hash3,SRAM_L2_ADDR_HASH_BASE) // get sa_lookup_result on L2
lookup
end if
if (da_lookup_result == 0) // MAC entry no exist
da_port_num = DEST_PORT_NO_MATCH
else
84
ethfilt_tempb = 0xfffffff & da_lookup_result // SDRAM index
da_lookup_result = ethfilt_tempb
copy $$dxfer <- SDRAM(addr(0 + da_lookup_result), 2 quadwords) // read the forwarding table for DA
ethfilt_tempa = 0
ethfilt_tempa = FORWARD_ENTRY_MASK & ($$dxfer >> FORWARD_ENTRY_SHIFT_SIZE)
if (ethfilt_tempa) // If forwarding information is associated with this entry
da_port_num = $$dxfer[0] & 0x1F// isolate DA port number from forwarding table
Extract (dst_port_entry) <- ether_port_info(da_port_num ) // Get port information of the destination port
end if
end if
src_port_ethertype = 0 + ((src_port_entry >> 8), byte_enable(0011))
dst_port_ethertype = 0 + ((dst_port_entry >> 8), byte_enable(0011))
src_port_filtertype = 0x03 & (src_port_entry >> 4)
dst_port_filtertype = 0x03 & (dst_port_entry >> 4)
pkaction = PKT_PERMIT // Default action
if (((src_port_filtertype == FILT_TYPE_ETHERTYPE) || (dst_port_filtertype == FILT_TYPE_ETHERTYPE)) &&
(da_port_num != DEST_PORT_NO_MATCH))
ethfilt_tempa = ethertype & src_port_ethertype
if (ethfilt_tempa != dst_port_ethertype)
pkaction = PKT_DENY
goto filter_return
end if
end if
if ((src_port_filtertype == FILT_TYPE_EXPLICT) && sa_lookup_result) // SA filter
ethfilt_tempb = 0xfffffff & sa_lookup_result // SDRAM index
sa_lookup_result = ethfilt_tempb
copy $$dxfer <- SDRAM(addr(0 + sa_lookup_result), size(2quadwods)) // read the forwarding table for SA (4
longwords)
pkaction = BR_FILTER_ACTION_MASK & ($$dxfer >> FILTER_ACTION_SA_SHIFT_SIZE)
if (pkaction == PKT_DENY)
goto filter_return
end if
end if
if ((dst_port_filtertype == FILT_TYPE_EXPLICT) && da_lookup_result) // DA filter
85
pkaction = BR_FILTER_ACTION_MASK & ($$dxfer >> FILTER_ACTION_DA_SHIFT_SIZE)
if (pkaction == PKT_DENY)
goto filter_return
end if
end if
if ((src_port_filtertype == FILT_TYPE_POSITIVE) || (dst_port_filtertype == FILT_TYPE_POSITIVE))
// Positive filtering - default action is to permit. SA/DA entry in the table will be dropped.
//DA filter
if (da_lookup_result)
pkaction = PKT_DENY
goto filter_return
end if
// SA filter
if (sa_lookup_result)
pkaction = PKT_DENY
goto filter_return
end if
end if
if ((src_port_filtertype == FILT_TYPE_NEGATIVE) || (dst_port_filtertype == FILT_TYPE_NEGATIVE))
// Negative filtering - default action is to deny. SA/DA entry in the table will be allowed.
// DA filter
if (!da_lookup_result)
pkaction = PKT_DENY
goto filter_return
end if
// SA filter
if (!sa_lookup_result)
pkaction = PKT_DENY
goto filter_return
end if
end if
filter_return:
}
Note: This filter pseudo code includes Layer2 MAC Protocol filtering and/or Bridging
86
Figure A-9. Filter
//********************* Format of port table entry *******************
//**********************************************************************
ether_port_info(in_port_no)
{
// save port MAC address
xbuf_alloc($port_inf, 3)
ethport_tempa = in_port_no << BR_PORT_ENTRY_MULTIPLER
ethport_tempb = SRAM_PORT_STATE_BASE
copy $port_inf[0] <- SRAM(addr(ethport_tempb + ethport_tempa), size(3lwords))
out_port_entry = $port_inf [0]
extract out_port_mac_addr32 <- ($port_inf (addr(BYTEOFFSET0 +4), size(4bytes))
extract out_port_mac_addr16 <- ($port_inf (addr(BYTEOFFSET0 +10), size(2bytes))
xbuf_free($port_inf)
return
}
Figure A-10. Port information Acquisition for Filter
unsed ether unsed filter port
31:24 23:8 7:6 5:4 3:0 Type type state
MAC Address
MAC Address
15:0
Long word0
Long word1
Long word2
Long word3
87
//************************* Format of IPv4 header ************************
//***********************************************************************
Get_IP_heade r()
{
xbuf_alloc($pkt_buf_ip , 4 lwords)
xbuf_link($pkt_buf, $pkt_buf_ip )
xbuf_link($pkt_buf_ip , $pkt_buf)
copy $pkt_buf_ip[0] <- RFIFO(addr(rfifo_addr + QWOFFSET2), size(3quadwords))
return
}
Figure A-11. IP Header Acquisition
IP_version_check()
{
if (bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE)
extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET14), size(1bytes))
else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) == TRUE)
extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET17), size(1bytes))
else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)
extract ip_verslen <- $pkt_buf (addr(BYTEOFFSET22), size(1bytes))
// Save SA 2-5 as the packet wraps and overwrites pkt_buf0-1
extract tempa <- $pkt_buf (addr(BYTEOFFSET0 + 8), size(4bytes)) // Save SA bytes 2-5 as the following rfifo_read
overwrites
extract tempb <- $pkt_buf (addr(BYTEOFFSET0 + 12), size(4bytes)) // Save len/ssap/dsap as the following rfifo_read
overwrites
version HLen
0:3 4:7 8:15 TOS Lengths
16:31 Ident 0:15 16:18
Flags Offset 19:31
Checksum Protocol TTL 0:7 8:15 16:31
Source Address
Destination Address 0:31
0:31 Options(variable) Pad(variable)
Long word0
Long word1
Long word2
Long word3
Long word4
Long word5
88
extract tempc <- $pkt_buf (addr(BYTEOFFSET0 + 16), size(4bytes)) // Save CTL/OUI as the following rfifo_read
overwrites
copy $pkt_buf[2] <- RFIFO(addr(rfifo_addr + QWOFFSET5), size(1quadword))
end if
return
}
Figure A-12. IP Version Check
xferpayload_&_iphdrchck_&_modify()
{
if (bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE)
copy RFIFO(addr(rfifo_addr + QWOFFSET4), size(4quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET4))
exception = ip_verify($pkt_buf, BYTEOFFSET14)
ip_modify($$dxfer, BYTEOFFSET14, $pkt_buf, BYTEOFFSET14)
extract ip_dest <- $pkt_buf (addr(BYTEOFFSET14 + 16), size(4bytes))
else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) == TRUE)
copy RFIFO(addr(rfifo_addr + QWOFFSET4), size(4quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET4))
exception = ip_verify($pkt_buf, BYTEOFFSET17)
ip_modify($$dxfer, BYTEOFFSET17, $pkt_buf, BYTEOFFSET17)
$$dxfer[3] = $pkt_buf [3]
extract ip_dest <- $pkt_buf (addr(BYTEOFFSET17 + 16), size(4bytes))
else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)
copy RFIFO(addr(rfifo_addr + QWOFFSET5), size(3quadwords)) -> DRAM(addr(pkbuf_addr + QWOFFSET5))
exception = ip_verify($pkt_buf, BYTEOFFSET22)
ip_modify($$dxfer, BYTEOFFSET22, $pkt_buf, BYTEOFFSET22)
extract ip_dest <- $pkt_buf (addr(BYTEOFFSET22 +16), size(4bytes))
end if
xbuf_free($pkt_buf_ip ) // release $pkt_buf_ip assigned by get_IP_header
return
}
Figure A-13. IP Header Check & Modify
89
ip_verify($pkt_buf, BYTEOFFSET)
{
total_len_verify:
extract total_len <- $pkt_buf (addr(BYTEOFFSET+2), size(2bytes))
if ((total_len– 0x14)>0) // at least 20bytes(=IP header length)
goto ttl_verify
else
exception = IP_BAD_TOTAL_LENGT H
goto end
end if
ttl_verify:
extract ttl <- $pkt_buf (addr(BYTEOFFSET+8), size(1byte))
If (ttl > 0) // at least greater than 1
exception = 0
goto cksum_verify
else
exception = IP_BAD_TTL
goto end
end if
cksum_verify:
exception = ip_cksum_verify($pkt_buf, addr(BYTEOFFSET+10), size(2bytes))
if (exception = 0)
goto end
else
exception = IP_BAD_CHECKSUM
end if
end:
return(exception)
}
Figure A-14. IP verify
90
ip_modify($$dxfer, IPHDR_WR_BYTEOFFSET, $pkt_buf, IPHDR_RD_BYTEOFFSET)
{
xbuf_xfer_set($pkt_buf, IPHDR_RD_START_BYTE) // define as $pkt_buf [0:7]
xbuf_xfer_set($$dxfer, IPHDR_WR_START_BYTE) // define as $$dxfer [0:7]
// alignment check
RD_align = read_align_check (IPHDR_RD_BYTEOFFSET & 0x3)
WR_align = write_align_check (IPHDR_WR_BYTEOFFSET & 0x3)
#if (RD_align != WR_align)
display assembler error
#else
#if (RD_align == 0)
$$dxfer[0] = $pkt_buf[0]
$$dxfer[1] = $pkt_buf[1]
temp = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET0, size(1byte)) // ttl = ttl -1
$$dxfer[2] = ip_cksum_modify(temp, BYTEOFFSET2, size(2byte))
$$dxfer[3] = $pkt_buf[3]
$$dxfer[4] = $pkt_buf[4]
#elif (RD_align == 1)
$$dxfer[0] = $pkt_buf[0]
$$dxfer[1] = $pkt_buf[1]
temp = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET1, size(1byte)) // ttl = ttl –1
$$dxfer[2:3] = ip_cksum_B3align_modify(temp, $pkt_buf[3], size(2byte)) // because of ttl decr
$$dxfer[4] = $pkt_buf[4]
#elif (RD_align == 2)
$$dxfer[0] = $pkt_buf[0]
$$dxfer[1] = $pkt_buf[1]
$$dxfer[2] = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET2, size(1byte)) // ttl = ttl -1
$$dxfer[3] = ip_cksum_modify($pkt_buf[3], BYTEOFFSET0, size(2byte)) // because of ttl decr
$$dxfer[4] = $pkt_buf[4]
#elif (RD_align == 3)
$$dxfer[0] = $pkt_buf[0]
$$dxfer[1] = $pkt_buf[1]
$$dxfer[2] = ip_ttl_decrement($pkt_buf[2], BYTEOFFSET3, size(1byte)) // ttl = ttl –1
$$dxfer[3] = ip_cksum_modify($pkt_buf[3], BYTEOFFSET2, size(2byte)) // because of ttl decr
91
$$dxfer[4] = $pkt_buf[4]
#endif
#endif
return
}
Figure A-15. IP Modify
pk_late_discard(rec_req, exception)
{
// description: Increment exception counter, total discards, set discard flag
tempa = EXCEPTION_COUNTERS
tempa = (tempa + rec_req << 4), bit_enable(LS8bit))
increment 1 Scratchpad(addr(tempa + exception))
tempa = TOTAL_DISCARDS
increment 1 Scratchpad(addr(tempa))
rec_state = rec_state | 1 << /**/REC_STATE_DISCARD_BIT/**/ // set discard
return(rec_state)
}
Figure A-16. Packet Discard
ip_trie5_lookup(ip_dest, SRAM_ROUTE_LOOKUP_BASE)
{
tables_base = SRAM_ROUTE_LOOKUP_BASE
temp_base2 = tables_base + (1 << 16) //add 0x10000, 256 entry table
temp_base3 = temp_base2 + (1 << 8) // add 0x100, multiple 16 entry tables
offset = ip_dest >> 16 // form offset from 31:16
first_lookup:
copy $rd_xfer0 <- SRAM(addr(tables_base + offset), size(1 lword)) // direct lookup off addr 31:16
offset = ip_dest >> 24 // form offset from 31:24
copy $rd_xfer1 <- SRAM(addr(temp_base2 + offset), size(1 lword)) // direct lookup off addr 31:24
prev_rt_long = 0
lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))
if (lookup_short == 0)
92
goto long_path_only
else
lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))
if (lookup_long == 0)
goto short_path_only
else
goto both_path
end if
short_path_only:
second_lookup_short:
next_trie(ip_dest, 20, prev_rt_short, lookup_short, $rd_xfer1 , temp_base3)
if (lookup_short == 0)
goto set_route_ptr
end if
third_lookup_short:
next_trie(ip_dest, 16, prev_rt_short, lookup_short, $rd_xfer1 , temp_base3)
if (lookup_short == 0)
goto set_route_ptr
else
goto set_route_ptr
end if
long_path_only:
lookup_long = 0 + ($rd_xfer0, byte_enable(0011)
if (lookup_long == 0)
goto set_route_ptr
end if
second_lookup_long:
next_trie(ip_dest, 12, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)
if (lookup_long == 0)
goto set_route_ptr
end if
third_lookup_long:
93
next_trie(ip_dest, 8, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)
if (lookup_long == 0)
goto set_route_ptr
end if
fourth_lookup_long:
next_trie(ip_dest, 4, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)
if (lookup_longer == 0)
goto set_route_ptr
end if
fifth_lookup_long:
next_trie(ip_dest, 0, prev_rt_long, lookup_long, $rd_xfer0 , temp_base3)
if (lookup_longer == 0)
goto set_route_ptr
end if
both_paths:
lookup_short = ((lookup_short + (ip_dest >> 20)), bit_enable(LS4bit))
copy $rd_xfer1 <- SRAM(addr(temp_base3 + lookup_short), size(1 lword))
prev_rt_short = 0 + ($rd_xfer1 , byte_enable(1100))
lookup_long = (lookup_long + (ip_dest >> 12), bit_enable(LS4bit))
copy $rd_xfer0 <- SRAM(addr(temp_base3 + lookup_long), size(1 lword))
prev_rt_long = 0 + ($rd_xfer0 , byte_enable(1100))
lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))
if (lookup_long == 0)
goto second_both_no_long
end if
second_both_long:
lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))
if (lookup_short == 0)
goto third_lookup_long
else
goto third_lookup_both
end if
second_both_no_long:
94
lookup_short = 0 + ($rd_xfer1 , byte_enable(0011))
if (lookup_short == 0)
goto set_route_ptr
else
goto third_lookup_short
end if
third_lookup_both:
lookup_short = ((lookup_short + (ip_dest >> 16)), bit_enable(LS4bit))
copy $rd_xfer1 <- SRAM(addr(temp_base3 + lookup_short), size(1 lword))
prev_rt_short = 0 + ($rd_xfer1 , byte_enable(1100))
lookup_long = ((lookup_long + (ip_dest >> 8)), bit_enable(LS4bit))
copy $rd_xfer0 <- SRAM(addr(temp_base3 + lookup_long), size(1 lword))
prev_rt_long = 0 + ($rd_xfer0 , byte_enable(1100))
lookup_long = 0 + ($rd_xfer0 , byte_enable(0011))
if (lookup_long == 0)
goto set_route_ptr
else
goto fourth_lookup_long
end if
set_route_ptr:
rt_ptr = $rd_xfer0 >> 17 // long match
if (rt_ptr!= 0)
goto end
end if
rt_ptr = prev_rt_long >> 17 // long match at previous trie
if (rt_ptr!= 0)
goto end
end if
rt_ptr = $rd_xfer1 >> 17 // short match
if (rt_ptr!= 0)
goto end
end if
rt_ptr = prev_rt_short >> 17 // short match at previous trie
95
end:
return(rt_ptr)
}
Figure A-17. Trie Lookup next_trie(ipaddr, SHIFT_AMT, prevout_rt_ptr, lookup , $xfer, trie_base)
{
lookup = 0 + ((lookup + (ipaddr >> SHIFT_AMT)), bit_enable(LS4bit))
copy $xfer <- SRAM(addr(trie_base + lookup), size(1 lword))
prevout_rt_ptr = 0 + ($xfer, byte_enable(1100))
lookup = 0 + ($xfer, byte_enable(0011))
return(lookup)
}
Figure A-18. Next_Trie_Search for Trie Lookup
write_modified_IP_Ether_header()
{
if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)
copy $$dxfer[0] -> DRAM(addr(pkbuf_addr + QWOFFSET4), size(1quadwords))
end if
// $$dxfer0 = output port, $$dxfer1 = MAC DA bytes 0-3, $$dxfer2 = MAC DA bytes 4-5
output_intf = $$dxfer[0] << 3 // save for enqueue
$$dxfer[0] = $$dxfer[1] // new DA bytes 0-3
#ifdef LITTLE_ENDIAN
$$dxfer[1] = $$dxfer[2] + (sa01 << 16)// merge new DA 4-5 with SA 0-1
#else
$$dxfer[1] = sa01 + ($$dxfer[2] << 16)// merge new DA 4-5 with SA 0-1
#endif
if ((bit(pkstate, SHIFT_PKTLINKTYPE_ETHERNET) == TRUE) || (bit(pkstate, SHIFT_PKTLINKTYPE_LLC) ==
TRUE))
$$dxfer[2] = $$pkt_buf[2] // previous SA bytes 2-5
else if (bit(pkstate, SHIFT_PKTLINKTYPE_LLCSNAP) == TRUE)
$$dxfer[2] = $$tempa // previous SA bytes 2-5
96
$$dxfer[3] = $$tempb // len/ssap/dsap
$$dxfer[4] = $$tempc // CTL/OUI
end if
copy $$dxfer[0:7] -> DRAM(addr(pkbuf_addr + QWOFFSET0), size(4 quadwords)) // write modified packet
return
}
Figure A-19. Write Modified IP and Ether Header
tx_assignment_read(@assign#)
{
wait_for_assignment:
if (@assign# < 0)
goto wait_for_assignment
else
port = (@assign#, bit_enable(LS4bit))
skip_flag = @assign# & skip_bit_on // skip_bit_on is set in initialization
tfifo_entry = (port , bit_enable(LS4bit))
q_offset = port << 4
end if
return
}
Figure A-20. Transmit Assignment Read
97
//****************** Format of Queue Descriptor *********************
//********************Format of Packet Link List ********************
//*******************************************************************
tx_pktlinklist_read(q_desc_base, q_offset, buf_desc_base)
{
copy $q_desc0 <- SRAM(addr(q_desc_base + q_offset), size(2 lwords)) with lock
// lock and read the queue descriptor 2 longwords // unlocked by tx_pktlink_update
tmp_head_ptr = q_desc0 >> 16 // isolate head ptr
buf_offset = ~0x7 & q_desc0 >> 13 // isolate next packet link and
//mult by 8 to get relative address
copy $pkt_link0 <- SRAM(addr(buf_desc_base + tmp_head_ptr), size(2 lwords))
//read packet_link 2 longwords get next head, status
tail_ptr = 0 + (q_desc0, byte_enable(0011))
ele_remaining = 0 + ($pkt_link1 >> DESC1_ELE_COUNT0, byte_enable(0001))
last_mpkt_byte_cnt = 0x3f & ($pkt_link1 >> DESC1_PKT_END_BYTE8)
bank = bit20on & ($pkt_link1 >> 4)
}
Figure A-21. Transmit Packet Link List Read
tx_packetlinklist_update($q_desc0, $q_desc1, q_desc_base, tail_ptr, q_offset, $pkt_link0, port)
{
q_pkt_cunt = $q_desc1 -1 // decrement the element count
if (q_pkt_cunt > 0)
goto packets_remaining
else
Q_desc0 tail_ptr
Q_desc1
head_ptr 31:16 15:0
Packet Count 31:0
Next Packet Link pkt_link0
31:0 RCV_port Freelist Pkt_start pkt_end ele_count pkt_link1
31:27 26:24 23:16 15:8 7:0 _byte _byte
98
tx_portvector_clear(port, pwp_addr)
end if
packets_remaining:
tail_ptr = (tail_ptr,byte_enable(0011)) + ($pkt_link0 << 16, byte_enable(1100))
$q_desc0 = tail_ptr
$q_desc1 = 0 + (q_pkt_cunt, byte_enable(0011))
copy $q_desc0 -> SRAM(addr(q_desc_base + q_offset), size(2 lwords)) and unlock // locked by tx_linklist
}
Figure A-22. Transmit Packet Link List Update
tx_portvector_clear(port, pwp_addr)
{
tpc_temp = (1 << 5) – port // indirect shift left 32 - portnum
$xfer_reg = 1 << tpc_temp `
clear bit Scratch(addr(pwp_addr), bit position($xfer_reg)) //clear bit for this port
}
Figure A-23. Transmit Port Vector clear
tx_last_mpkt_xfr(bank , buf_offset, last_mpkt_byte_cnt,tfifo_entry, pkt_buffer_base)
{
qw_offset = bank + (buf_offset << 3)
indirect = 0x7 & (last_mpkt_byte_cnt >> 3) //divide by 8 for conversion to quadwords
indirect = bit20_15on | indirect << 16 //place quadword count in 19:16
copy SDRAM(addr(pkt_buffer_base + qw_offset), size(8qwords)) -> tfifo(indirect | tfifo_entry << 7)
}
Figure A-24. Last Packet Transfer
99
//****************** Format of TFIFO Control Field *********************
//***********************************************************************
tx_status_set(last_mpkt_byte_cnt, BITS_TO_SET, port)
{
temp = BITS_TO_SET | last_mpkt_byte_cnt << 2 // ex) 16 elements count OR EOP_AND_SOP = 3
$tfifo_ctl_wd0 = port | temp << 8
}
Figure A-25. Set Transmit Control Word tfifo_validate(tfifo_entry, $tfifo_ctl_wd0)
{
tfifo_status_write(tfifo_entry, $tfifo_ctl_wd0)
xmit_ptr_wait:
copy $xmit_ptr <- CSR_XMIT_PTR
copy $tx_rdy_copy <- CSR_XMIT_RDY_LO
temp_reg = $xmit_ptr - tfifo_entry
if (temp_reg =0)
goto port_wait_loop
end if
if (temp_reg > 0)
goto ptr_wrapped // xmit ptr > t_fifo_entry -> wrap condition
end if
temp_reg = temp_reg + 5
if (temp_reg >=0)
goto port_wait_loop
else
goto xmit_ptr_wait // the xmit_ptr is not close enough yet
end if
ptr_wrapped:
31:19 18 17 16 15:13 12:10 9 8 7 6:4 3:0
RES Tx Tx Pre # Valid EOP SOP Skip mac Port
Err asis pnd qwds bytes
100
temp_reg = temp_reg –11
if (temp_reg < 0)
goto xmit_ptr_wait // the xmit_ptr is not close enough yet
end if
port_wait_loop:
if ((1 & $tx_rdy_copy >> tfifo_entry) > 0)
return_status = PASS
goto write_validate
else
$tfifo_ctl_wd0 = tfifo_entry | 1 << 7 // set skip bit
tfifo_status_write(tfifo_entry, $tfifo_ctl_wd0)
return_status = FAIL
write_validate:
tfifo_validate_write(tfifo_entry, in_bit15on)
return
}
Figure A-26. TFIFO Validate
tx_portvect_modify(@local_pwp , port, IN_VALUE)
{
hold_it = (1 << 5) - port
hold_it = 1 << hold_it
#if (IN_VALUE == 1) // if set bit
@local_pwp = @local_pwp | hold_it
#else // IN_VALUE == 0 (clear bit)
@local_pwp = @local_pwp & ~(hold_it)
#endif
}
Figure A-27. Transmit Port Vector Modify
101
tx_mpkt_xfr(bank, buf_offset, tfifo_entry, pkt_buffer_base, 8)
{
qw_offset = bank + (buf_offset << 3)
indirect = bit20_15on | 7 << 16 // place quadword count 7 in 19:16
indirect = indiret | tfifo_entry << 7 //put element no. in 10:7
copy SDRAM(addr(pkt_buffer_base + qw_offset), size(8bytes)) -> t_fifo(indirect)
}
Figure A-28. Transmit Packet Transfer
102
Appendix B: Microengine Instruction Set
Table B-1. Microengine Instruction Set
Instruction Description
Arithmetic,Rotate, and Shift
Instructions
alu
Perform an ALU operation on one or two operands and deposit the result into
the destination register. Update all ALU condition codes according to the
result of the operation. Condition codes are lost during context swaps. The
sign condition code is not valid on underflow or overflow conditions.
alu_shf
Perform an ALU operation on one or two operands and deposit the result into
the destination register. The B operand is shifted or rotated prior to the ALU
operation. Update all ALU condition codes according to the result of the
operation. Condition codes are lost during context swaps. The sign condition
code is not valid on underflow or overflow conditions.
dbl_shf
Load a destination register with a 32-bit longword that is formed by
concatenating the A operands and B operands together, right shifting the
64-bit quantity by the specified amount, and then storing the lower 32 bits
Branch and Jump Instructions
br Branch unconditionally
br=0, br!=0, br>=0, br>=0, br<0,
br<=0, br=cout, br!=cout
Branch to an instruction at a specified label based on an ALU condition code.
The ALU condition codes are Sign, Zero, and Carryout (cout). The sign
condition code is not valid on underflow or overflow conditions.
br_bset, br_bclr Branch to the instruction at the specified label when the specified bit of the
register is clear or set. These instructions set the condition codes.
br=byte, br!=byte
Branch to the instruction at the specified label if a specified byte in a longword
matches or mismatches the byte_compare_value. The br=byte instruction
prefetches the instruction for the “branch taken” condition rather than the
next sequential instruction. The br!=byte instruction prefetches the next
sequential instruction. These instruction set the condition codes.
br=ctx, br!=ctx Branch to the instruction at the specified label based on whether or not the
current context is the specified context number.
103
Instruction Description
br_inp_state
Branch if the state of the specified state name is set to1. A state is set to 1 or 0
by a functional unit in the IXP1200 and indicates the currently processing
state. It is available to all microengines.
br_!signal Branch if the specified signal is deasserted. If the signal is asserted, clear the
signal an do not branch.
Jump Unconditional branch to an address that is formed during runtime execution
by the addition of the register and label# values.
rtn
Unconditional branch to the address contained in the lower 10bits of the
specified register(address 0 through 1023). Typically used to return from a
branch or jump instruction.
Reference Instructions
csr
Issue a read or write operation to the specified control/status register(CSR).
Transfers exactly one 32-bit register value to or from the specified SRAM
transfer register.
fast_wr
Write the specified immediate data to the specified FBI CSR. A fast write
operation has the write data specified directly in the instruction rather than in
a transfer register. This improves performance by eliminating the need for the
FBI Unit to pull the data from a transfer register. The FBI Unit automatically
shifts the immediate data into the appropriate register field corresponding to
the thread that is writing the FAST_WR data.
local_csr_rd
Read th e specified 16-bit microengine CSR register. The 16-bit read data is
accessed by replacing the immediate data source operand of the next
instruction with the microengine CSR read data. If the very next instruction
does not contain an immediate data source operand field, then the opportunity
to accesss the CSR data read from the previous instruction is lost. A
local_csr_rd or local_csr-wr instruction must not immediately follow or
proceed a local_csr_wr instruction
local_csr_wr
Write specified microengine CSR register with the lower 16 bits of the
specified source register. Unlike normal GPR registers, no built in bypasses
exist in the datapath when reading microengine CSRs immediately after
writing them. Therefore, to compensate for microengine CSR read/write
latency, a local_csr_rd to a given CSR must be at least the third opcode
104
Instruction Description
following a local_csr_wr to the same CSR in order for CSR read data to reflect
CSR write data. A local_csr_wr instruction must not be placed in the last
deferred window of an instruction. Also, a local_csr_rd or local_csr_wr
instruction must not immediately follow or proceed a local_csr_wr instruction.
r_fifo_rd Issue a read reference from the receive FIFO data and status elements to a
transfer register
pcl_dma Used to issue DMA requests to the PCI Unit. Improved performance can be
achieved if DMA data is located on 64 byte boundaries.
scratch Issue a memory reference to scratchpad memory
sdram Issue a memory reference to SDRAM
sram Issue a memory reference to SRAM, Flash, or Slow Port
t_fifo_wr Issue a write reference from a transfer register data and control/prepend
elements directly to the transmit FIFO
Local Register Instructions
find_bset, find_bset_with_mask
Returns the bit position number of the first set bit in a 16-bit field of a
microengine register. Provides an optional shift control token that enables any
arbitrary 16-bit field to be evaluated. The result of the operation is deposited
into one of two result registers that are not visible to the microengines. The
microengines must explicitly move the contents on the result registers into one
of the microengine GPR or transfer registers via the load_bset_result1 and
load_bset_result2.
immed
Load immediate 16-bits into the specified register. The immediate data must
be specified having the upper 16-bits equal to either all zeros or ones. The
immediate data can be stored in the longword aligned on an 8-bit boundary
based on the optional shift parameter. The fill data is either all zeros or ones
and is based on the specified upper 16-bits.
immed_bo,immed_b1,
immed_b2, immed_b3
If a GPR is specified as the dest_reg, one byte of immediate data is loaded into
the specified byte of the destination while preserving all the other bits of the
destination. These instructions perform a read-modify-write operation on a
specified destination register. If a Transfer register is specified as the dest_reg,
these instructions perform a read and modify from a read transfer register and
write the result into a write transfer register.
105
Instruction Description
immed_wo, immed_w1
If a GPR is specified as the dest_reg, one word of immediate data is loaded into
the specified word of the destination while preserving all the other bits of the
destination. These instructions perform a read-modify-write operation on a
specified destination register.
ld_field, ld_foeld_w_clr
Load 1 or more bytes within a register with the shifted value of another
operand. Data in the bytes that are not loaded remain unchanged or are
cleared. Ld_field performs a read-modify-write on a destination register.
Ld_field_w_clr performs a write to a destination register. When a transfer
register is used as the destination register, ld_field reads from the read
transfer register and writes the modified data to the write transfer register.
load_addr Load a register with an address of the location specified by label#
load_bset_result1,
load_bset_result2
Load the specified register with the result of a find_bset or
find_bset_with_mask instruction. These instructions set the condition codes. If
the result is 0, then the result register data is invalid and the find_bset
instruction did not detect a set bit. Due to latency issues in the hardware, a
minimum of three microengine cycles (equivalent to three instructions) must
occur between the final find_ bset instruction and the load_bset_result in order
for the result registers to reflect the result of the final find_bset instruction.
After a find_bset or find_bset_with_mask instruction is deposited into a result
register, the result register is validated and locked until it is explicitly cleared
by the user. If the first result register is locked, the second result register will
be loaded and locked when the next set bit is detected. If both result registers
are locked then the result is not reported. The result registers are explicitly
unlocked (or cleared) using the clr_results optional token.
Miscellaneous Instructions
ctx_arb Swap the currently running context out to let another context execute. Wake
up the swapped out context when the specified signal is activated.
nop Consume one microcycle without performing any operation and without
setting any microengine state
hash1_48, hash2_48, hash3_48 Executes one, two, or three 48-bit hash operations
hash1_64, hash2_64, hash3_64 Executes one, two, or three 64-bit hash operations
106
Appendix C: Instruction Mix Data
Table C-1. Instruction Mix Data for 64bytes packets
Instruction Description uEngine0uEngine1 uEngine2 uEngine3uEngine4uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 40838 40703 40817 41015 37382 37645 163373 75027 238400alu_shf Perform an alu and shift operation 55298 56617 56418 55305 85494 84574 223638 170068 393706
Sub Total 96136 97320 97235 96320 122876 122219 387011 245095 632106Percentage 40.8% 48.2% 43.4%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 61389 60253 59679 60116 63440 69028 241437 132468 373905
br_bset, br_bclr Branch on bit set or bit clear 5536 5835 5833 5827 0 0 23031 0 23031br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 10196 10192 56 20388 20444br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 1643 1572 0 3215 3215rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 67281 66440 65865 66296 75279 80792 265882 156071 421953Percentage 28.0% 30.7% 29.0%Reference Instructionscsr Csr reference 4768 4192 4177 4239 4678 4699 17376 9377 26753fast_wr Write immediate data to thd_done csrs 0 0 0 0 6108 6109 0 12217 12217local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2142 2338 2338 2336 0 0 9154 0 9154pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 871 772 742 500 2405 2319 2885 4724 7609sdram Sdram reference 2430 2724 2723 2721 3059 3057 10598 6116 16714sram Sram reference 7295 8173 8201 8442 7750 7361 32111 15111 24323t_fifo_wr Write to the transmit fifo 0 0 0 0 3087 3094 0 6181 6181Sub Total 17506 18199 18181 18238 27087 26639 72124 53726 125850Percentage 7.6% 10.6% 8.6%Local Register Instructionsfind_bset, find_bset_with_mask Determine position number of first bit set
in an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0
immed Load immediate word and sign extend orzero fill with shift. 23458 22802 22734 22613 55 65 91607 120 91727
immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 15185 16744 16809 17285 6200 5808 66023 12008 78031load_addr Load instruction address. 0 0 0 0 0 0 0 0 0
load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0
Sub Total 38646 39550 39547 39902 6255 5873 157645 12128 169773Percentage 16.6% 2.4% 11.7%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 16403 15382 15437 15162 16306 16407 62384 32713 95097nop Perform no operation. 0 0 0 0 4743 4476 0 9219 9219hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 779 779 779 779 0 0 3116 0 3116hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 17182 16161 16216 15941 21049 20883 65500 41932 107432Percentage 6.9% 8.2% 7.4%TOTAL 236751 237670 237044 236697 252546 256406 948162 508952 1457114
0 0 00 0 0 0Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0
107
Table C-2. Instruction Mix Data for 594bytes packets
Instruction Description uEngine0uEngine1 uEngine2uEngine3 uEngine4uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 252209 251109 251407 251782 270134 270583 1006507 540717 1547224alu_shf Perform an alu and shift operation 235238 238200 236792 235271 706257 693919 945501 1400176 2345677
Sub Total 487447 489309 488199 487053 976391 964502 1952008 1940893 3892901Percentage 32.5% 51.3% 39.8%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 550680 547273 547142 546994 468202 472093 2192089 940295 3132384
br_bset, br_bclr Branch on bit set or bit clear 19576 19870 19870 19870 0 0 79186 0 79186br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 75932 75603 56 151535 151591br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 35251 33333 0 68584 68584rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 570612 567495 567365 567217 579385 581029 2272689 1160414 3433103Percentage 37.8% 30.7% 35.1%Reference Instructionscsr Csr reference 63518 63387 62803 62725 42576 42764 252433 85340 337773fast_wr Write immediate data to thd_done csrs 0 0 0 0 45549 45159 0 90708 90708local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2144 2340 2340 2340 0 0 9164 0 9164pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 875 776 728 455 8962 8860 2834 17822 20656sdram Sdram reference 9448 9742 9742 9742 22779 22583 38674 45362 84036sram Sram reference 7297 8174 8222 8495 7758 7364 32188 15122 104692t_fifo_wr Write to the transmit fifo 0 0 0 0 26990 26856 0 53846 53846Sub Total 83282 84419 83835 83757 154614 153586 335293 308200 643493Percentage 5.6% 8.2% 6.6%Local Register Instructions
find_bset, find_bset_with_mask Determine position number of first bit setin an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0
immed Load immediate word and sign extend orzero fill with shift. 126959 127183 125967 125538 17457 16647 505647 34104 539751
immed_bo, immed_b1, immed_b2, immed_b3Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 22208 23774 23870 24416 6208 5812 94268 12020 106288load_addr Load instruction address. 0 0 0 0 0 0 0 0 0
load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0
Sub Total 149170 150961 149841 149958 23665 22459 599930 46124 646054Percentage 10.0% 1.2% 6.6%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 213720 211737 211940 211769 104106 104924 849166 209030 1058196nop Perform no operation. 0 0 0 0 59199 56418 0 115617 115617hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 780 780 780 780 0 0 3120 0 3120hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 214500 212517 212720 212549 163305 161342 852286 324647 1176933Percentage 14.2% 8.6% 12.0%TOTAL 1505011 1504701 1501960 1500534 1897360 1882918 6012206 3780278 9792484
Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0 0 0 00 0 0 0
108
Table C-3. Instruction Mix Data for 1518bytes packets
Instruction Description uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 650125 648994 649386 650209 720363 714363 2598714 1434726 4033440alu_shf Perform an alu and shift operation 442283 454063 453296 454855 1661437 1676711 1804497 3338148 5142645
Sub Total 1092408 1103057 1102682 1105064 2381800 2391074 4403211 4772874 9176085Percentage 30.3% 50.9% 38.4%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 1437275 1449626 1450082 1450071 1189315 1175727 5787054 2365042 8152096
br_bset, br_bclr Branch on bit set or bit clear 46780 33261 33303 33710 0 0 147054 0 147054br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 208426 207095 56 415521 415577br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 65712 68952 0 134664 134664rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 1484411 1483239 1483738 1484134 1463453 1451774 5935522 2915227 8850749Percentage 40.8% 31.1% 37.0%Reference Instructionscsr Csr reference 121032 118756 118141 120411 109475 104804 478340 214279 692619fast_wr Write immediate data to thd_done csrs 0 0 0 0 125829 124251 0 250080 250080local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2166 2354 2356 2360 0 0 9236 0 9236pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 884 784 784 402 22088 21985 2854 44073 46927sdram Sdram reference 16348 16435 16455 16659 62921 62128 65897 125049 190946sram Sram reference 49762 52949 53221 50683 7341 7789 206615 15130 237873t_fifo_wr Write to the transmit fifo 0 0 0 0 73268 73486 0 146754 146754Sub Total 190192 191278 190957 190515 400922 394443 762942 795365 1558307Percentage 5.3% 8.5% 6.5%Local Register Instructionsfind_bset, find_bset_with_mask Determine position number of first bit set
in an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0
immed Load immediate word and sign extend orzero fill with shift. 228378 224671 223414 227183 35737 37709 903646 73446 977092
immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 44599 30559 30590 31575 5876 6152 137323 12028 149351load_addr Load instruction address. 0 0 0 0 0 0 0 0 0
load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0
Sub Total 272980 255234 254008 258762 41613 43861 1040984 85474 1126458Percentage 7.2% 0.9% 4.7%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 597336 596105 596452 596067 288190 285074 2385960 573264 2959224nop Perform no operation. 0 0 0 0 114081 120426 0 234507 234507hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 788 785 785 787 0 0 3145 0 3145hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 598124 596890 597237 596854 402271 405500 2389105 807771 3196876Percentage 16.4% 8.6% 13.4%TOTAL 3638115 3629698 3628622 3635329 4690059 4686652 14531764 9376711 23908475
0 0 00 0 0 0Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0
109
Table C-4. Instruction Mix Data for Mixture packets
Instruction Description uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Rx(0,1,2,3) Tx(4,5) OverallArithmetic,Rotate, and Shift Instructionsalu Perform an alu operation 211184 213119 211770 212799 222931 224453 848872 447384 1296256alu_shf Perform an alu and shift operation 163693 162612 162362 161280 545913 534684 649947 1080597 1730544
Sub Total 374877 375731 374132 374079 768844 759137 1498819 1527981 3026800Percentage 31.9% 50.7% 39.2%Branch and Jump Instructionsbr, br=0, br!=0, br>=0, br>=0, br<0, br<=0,br>0, br=cout, br!=cout Branch on condition code 447651 453420 454827 454564 380178 384693 1810462 764871 2575333
br_bset, br_bclr Branch on bit set or bit clear 18291 14090 13762 14011 0 0 60154 0 60154br=byte, br!=byte Brabch on byte equal 0 0 0 0 0 0 0 0 0br=ctx, br!=ctx Branch on current context 17 13 13 13 61692 61699 56 123391 123447br_inp_state Branch on event state (e,g.,sram done). 0 0 0 0 0 0 0 0 0br_!signal Branch if signal deasserted 339 339 340 340 0 0 1358 0 1358jump Jump to label 0 0 0 0 23872 22618 0 46490 46490rtn Return from a branch or a jump 0 0 0 0 0 0 0 0 0Sub Total 466298 467862 468942 468928 465742 469010 1872030 934752 2806782Percentage 39.8% 31.0% 36.4%Reference Instructionscsr Csr reference 31180 27100 29915 28126 34663 35731 116321 70394 186715fast_wr Write immediate data to thd_done csrs 0 0 0 0 37005 37013 0 74018 74018local_csr_rd, local_csr_wr Read and write csrs 0 0 0 0 0 0 0 0 0r_fifo_rd Read the receive fifo 2110 2320 2072 2260 0 0 8762 0 8762pcl_dma Issue a request to the pci unit 0 0 0 0 0 0 0 0 0scratch Scratchpad reference 378 503 390 380 6866 6986 1651 13852 15503sdram Sdram reference 6963 6837 6688 6396 18506 18509 26884 37015 63899sram Sram reference 28808 32195 29592 30886 7902 7238 121481 15140 79402t_fifo_wr Write to the transmit fifo 0 0 0 0 22801 22360 0 45161 45161Sub Total 69439 68955 68657 68048 127743 127837 275099 255580 530679Percentage 5.8% 8.5% 6.9%Local Register Instructions
find_bset, find_bset_with_mask Determine position number of first bit setin an arbitrary 16-bit field of a register. 0 0 0 0 0 0 0 0 0
immed Load immediate word and sign extend orzero fill with shift. 66852 60814 64440 61808 11419 11011 253914 22430 276344
immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. 0 0 0 0 0 0 0 0immed_wo, immed_w1 Load immediate word to a field. 3 4 4 4 0 0 15 0 15ld_field, ld_field_w_clr Load byte(s) into specified field(s). 25513 21364 19778 20845 6325 5712 87500 12037 99537load_addr Load instruction address. 0 0 0 0 0 0 0 0 0
load_bset_result1, load_bset_result2 Load the result of a find_bset orfind_bset_with_mask instruction. 0 0 0 0 0 0 0 0 0
Sub Total 92368 82182 84222 82657 17744 16723 341429 34467 375896Percentage 7.3% 1.1% 4.9%Miscellaneous Instructionsctx_arb Perform context swap and wake on event. 177274 177985 179111 178960 89367 90537 713330 179904 893234nop Perform no operation. 0 0 0 0 42096 39693 0 81789 81789hash1_48, hash2_48, hash3_48 Perform 48-bit hash. 751 825 736 801 0 0 3113 0 3113hash1_64, hash2_64, hash3_64 Perform 64-bit hash. 0 0 0 0 0 0 0 0Sub Total 178025 178810 179847 179761 131463 130230 716443 261693 978136Percentage 15.2% 8.7% 12.7%TOTAL 1181007 1173540 1175800 1173473 1511536 1502937 4703820 3014473 7718293
Concatenate two longwords, shift theresult, and save a longword.dbl_shf 0 0 0 0 00 0 0 0
110
Table C-5. Memory Access per cycle
uEngine0 uEngine1 uEngine2 uEngine3 uEngine4 uEngine5 Average64B Reference Instructionscsr Csr reference 0.013912 0.012231 0.012187 0.0123682 0.013649 0.01371 0.01301fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.017821 0.017824 0.017823r_fifo_rd Read the receive fifo 0.00625 0.006822 0.006822 0.0068158 0 0 0.006677scratch Scratchpad reference 0.002541 0.002252 0.002165 0.0014589 0.007017 0.006766 0.0037sdram Sdram reference 0.00709 0.007948 0.007945 0.0079391 0.008925 0.008919 0.008128sram Sram reference 0.021285 0.023847 0.023928 0.0246314 0.022612 0.021477 0.022963t_fifo_wr Write to the transmit fifo 0 0 0 0 0.009007 0.009027 0.009017Total 0.051078 0.0531 0.053047 0.0532134 0.079032 0.077725 0.061199594B Reference Instructionscsr Csr reference 0.0255 0.025448 0.025213 0.0251818 0.017093 0.017168 0.022601fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.018286 0.01813 0.018208r_fifo_rd Read the receive fifo 0.000861 0.000939 0.000939 0.0009394 0 0 0.00092scratch Scratchpad reference 0.000351 0.000312 0.000292 0.0001827 0.003598 0.003557 0.001382sdram Sdram reference 0.003793 0.003911 0.003911 0.0039111 0.009145 0.009066 0.005623sram Sram reference 0.002929 0.003282 0.003301 0.0034104 0.003115 0.002956 0.003166t_fifo_wr Write to the transmit fifo 0 0 0 0 0.010835 0.010782 0.010809Total 0.033435 0.033891 0.033657 0.0336254 0.062072 0.061659 0.0430561518B Reference Instructionscsr Csr reference 0.019214 0.018853 0.018755 0.0191157 0.01738 0.016638 0.018326fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.019976 0.019725 0.019851r_fifo_rd Read the receive fifo 0.000344 0.000374 0.000374 0.0003747 0 0 0.000367scratch Scratchpad reference 0.00014 0.000124 0.000124 6.382E-05 0.003507 0.00349 0.001242sdram Sdram reference 0.002595 0.002609 0.002612 0.0026447 0.009989 0.009863 0.005052sram Sram reference 0.0079 0.008406 0.008449 0.0080461 0.001165 0.001237 0.005867t_fifo_wr Write to the transmit fifo 0 0 0 0 0.011632 0.011666 0.011649Total 0.030194 0.030366 0.030315 0.0302449 0.063648 0.062619 0.041231 MIX Reference Instructionscsr Csr reference 0.015459 0.013436 0.014832 0.0139448 0.017186 0.017715 0.015429fast_wr Write immediate data to thd_done csrs 0 0 0 0 0.018347 0.018351 0.018349r_fifo_rd Read the receive fifo 0.001046 0.00115 0.001027 0.0011205 0 0 0.001086scratch Scratchpad reference 0.000187 0.000249 0.000193 0.0001884 0.003404 0.003464 0.001281sdram Sdram reference 0.003452 0.00339 0.003316 0.0031711 0.009175 0.009177 0.00528sram Sram reference 0.014283 0.015962 0.014672 0.0153132 0.003918 0.003589 0.011289t_fifo_wr Write to the transmit fifo 0 0 0 0 0.011305 0.011086 0.011195Total 0.034428 0.034188 0.03404 0.033738 0.063335 0.063381 0.043851
111
Appendix D: Latency
0
20
40
60
80
100
15 25 35 45 55 65 75 85 95 105
cycles
cum
ula
tive
per
cen
tage
Microengine0
Microengine1Microengine2Microengine3
Figure D-1. Receive FIFO buffer Latency
0
20
40
60
80
100
0 20 40 60 80 100
cycles
cum
ulat
ive
perc
enta
ge
Microengine4
Microengine5
Figure D-2. Scratchpad RAM Latency
112
0
20
40
60
80
100
0 20 40 60 80 100 120 140
cycles
cum
ula
tvie
per
cen
tage
Microengine0Microengine1
Microengine2Microengine3Microengine4Microengine5
Figure D-3. FBI CSR Latency
0
20
40
60
80
100
30 40 50 60 70 80
cycles
cum
ula
tive
per
cen
tage
Microengine0
Microengine1Microengine2Microengine3
Figure D-4. Hash unit Latency
113
Table D-1. SDRAM Latency Data (1)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative43 201 4.3 4.3 234 4.7 4.7 288 4.6 4.6 212 4.2 4.244 204 4.4 8.7 194 3.9 8.6 171 3.4 8 184 3.7 7.945 112 2.4 11 121 2.4 11 143 2.9 10.8 132 2.6 10.646 82 1.8 12.8 88 1.8 12.8 104 2.1 12.9 92 1.8 12.447 54 1.2 13.9 46 0.9 13.7 56 1.1 14.1 70 1.4 13.848 34 0.7 14.7 31 0.6 14.3 35 0.7 14.8 55 1.1 14.949 51 1.1 15.8 62 1.2 15.5 76 1.5 16.3 68 1.4 16.350 43 0.9 16.7 47 0.9 16.5 42 0.8 17.1 46 0.9 17.251 57 1.2 17.9 76 1.5 18 70 1.4 18.5 71 1.4 18.652 41 0.9 18.8 37 0.7 18.7 61 1.2 19.7 45 0.9 19.553 71 1.5 20.3 83 1.7 20.4 84 1.7 21.4 79 1.6 21.154 56 1.2 21.5 52 1 21.4 64 1.3 22.7 52 1 22.155 60 1.3 22.8 79 1.6 23 44 0.9 23.6 81 1.6 23.856 52 1.1 23.9 52 1 24.1 50 1 24.6 51 1 24.857 58 1.2 25.1 67 1.3 25.4 84 1.7 26.3 73 1.5 26.258 44 0.9 26.1 76 1.5 26.9 54 1.1 27.3 56 1.1 27.459 55 1.2 27.2 83 1.7 28.6 76 1.5 28.9 87 1.7 29.160 52 1.1 28.3 42 0.8 29.4 54 1.1 29.9 57 1.1 30.361 67 1.4 29.8 80 1.6 31 69 1.4 31.3 80 1.6 31.962 49 1 30.8 47 0.9 32 50 1 32.3 42 0.8 32.763 66 1.4 32.2 83 1.7 33.6 67 1.3 33.7 83 1.7 34.464 52 1.1 33.3 72 1.4 35.1 65 1.3 35 61 1.2 35.665 89 1.9 35.2 90 1.8 36.9 88 1.8 36.7 85 1.7 37.366 53 1.1 36.4 64 1.3 38.2 66 1.3 38.1 57 1.1 38.467 90 1.9 38.3 79 1.6 39.7 98 2 40 80 1.6 4068 63 1.3 39.6 68 1.4 41.1 59 1.2 41.2 60 1.2 41.269 77 1.6 41.3 84 1.7 42.8 98 2 43.2 94 1.9 43.170 61 1.3 42.6 72 1.4 44.2 61 1.2 44.4 51 1 44.171 66 1.4 44 76 1.5 45.7 60 1.2 45.6 91 1.8 45.972 57 1.2 45.4 57 1.1 46.9 43 0.9 46.4 51 1 4773 88 1.9 47.1 77 1.5 48.4 70 1.4 47.8 81 1.6 48.674 59 1.3 48.4 54 1.1 49.5 55 1.1 48.9 58 1.2 49.775 100 2.1 50.5 86 1.7 51.2 67 1.3 50.3 78 1.6 51.376 49 1 51.5 54 1.1 52.3 56 1.1 51.4 50 1 52.377 68 1.5 53 80 1.6 53.9 89 1.8 53.2 87 1.7 54.178 56 1.2 54.2 64 1.3 55.2 40 0.8 54 51 1 55.179 83 1.8 56 91 1.8 57 85 1.7 55.7 72 1.4 56.580 52 1.1 57.1 65 1.3 58.3 54 1.1 56.8 60 1.2 57.781 72 1.5 58.6 95 1.9 60.2 73 1.5 58.2 82 1.6 59.482 53 1.1 59.7 53 1.1 61.3 62 1.2 59.5 50 1 60.483 70 1.5 61.2 67 1.3 62.6 87 1.7 61.2 78 1.6 61.984 46 1 62.2 64 1.3 63.9 45 0.9 62.1 56 1.1 6385 83 1.8 64 56 1.1 65 63 1.3 63.4 78 1.6 64.686 36 0.8 64.8 54 1.1 66.1 53 1.1 64.4 55 1.1 65.787 62 1.3 66.1 68 1.4 67.5 73 1.5 65.9 69 1.4 67.188 49 1 67.1 42 0.8 68.3 54 1.1 67 43 0.9 67.989 63 1.3 68.5 70 1.4 69.7 66 1.3 68.3 73 1.5 69.490 44 0.9 69.4 41 0.8 70.5 51 1 69.3 42 0.8 70.391 64 1.4 70.8 45 0.9 71.4 55 1.1 70.4 73 1.5 71.792 35 0.7 71.5 46 0.9 72.3 49 1 71.4 30 0.6 72.393 54 1.2 72.7 58 1.2 73.5 67 1.3 72.7 53 1.1 73.494 38 0.8 73.5 45 0.9 74.4 46 0.9 73.7 37 0.7 74.195 63 1.3 74.8 63 1.3 75.7 48 1 74.6 55 1.1 75.2
Microengine0 Microengine1 Microengine2 Microengine3
114
Table D-1. SDRAM Latency Data (2)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative96 35 0.7 75.6 31 0.6 76.3 43 0.9 75.5 34 0.7 75.997 50 1.1 76.7 46 0.9 77.2 39 0.8 76.3 59 1.2 77.198 35 0.7 77.4 38 0.8 78 38 0.8 77 43 0.9 77.999 43 0.9 78.3 42 0.8 78.8 52 1 78.1 48 1 78.9
100 44 0.9 79.3 25 0.5 79.3 26 0.5 78.6 37 0.7 79.6101 51 1.1 80.4 51 1 80.3 52 1 79.6 49 1 80.6102 25 0.5 80.9 32 0.6 81 38 0.8 80.4 24 0.5 81.1103 42 0.9 81.8 44 0.9 81.8 40 0.8 81.2 29 0.6 81.7104 26 0.6 82.3 36 0.7 82.6 25 0.5 81.7 32 0.6 82.3105 26 0.6 82.9 45 0.9 83.5 39 0.8 82.5 39 0.8 83.1106 20 0.4 83.3 26 0.5 84 31 0.6 83.1 37 0.7 83.8107 35 0.7 84.1 39 0.8 84.8 40 0.8 83.9 38 0.8 84.6108 26 0.6 84.6 31 0.6 85.4 24 0.5 84.4 27 0.5 85.1109 31 0.7 85.3 33 0.7 86 40 0.8 85.2 39 0.8 85.9110 20 0.4 85.7 28 0.6 86.6 28 0.6 85.7 32 0.6 86.6111 33 0.7 86.4 25 0.5 87.1 32 0.6 86.4 28 0.6 87.1112 24 0.5 86.9 27 0.5 87.7 18 0.4 86.7 26 0.5 87.6113 27 0.6 87.5 30 0.6 88.3 32 0.6 87.4 26 0.5 88.2114 13 0.3 87.8 20 0.4 88.7 16 0.3 87.7 26 0.5 88.7115 28 0.6 88.4 31 0.6 89.3 42 0.8 88.5 31 0.6 89.3116 18 0.4 88.8 22 0.4 89.7 20 0.4 88.9 21 0.4 89.7117 33 0.7 89.5 29 0.6 90.3 24 0.5 89.4 27 0.5 90.3118 20 0.4 89.9 15 0.3 90.6 23 0.5 89.9 15 0.3 90.6119 22 0.5 90.4 34 0.7 91.3 24 0.5 90.4 27 0.5 91.1120 18 0.4 90.8 18 0.4 91.6 13 0.3 90.6 20 0.4 91.5121 18 0.4 91.1 21 0.4 92.1 31 0.6 91.2 27 0.5 92.1122 19 0.4 91.5 16 0.3 92.4 19 0.4 91.6 12 0.2 92.3123 32 0.7 92.2 24 0.5 92.9 30 0.6 92.2 19 0.4 92.7124 17 0.4 92.6 11 0.2 93.1 15 0.3 92.5 13 0.3 92.9125 24 0.5 93.1 21 0.4 93.5 23 0.5 93 18 0.4 93.3126 11 0.2 93.3 15 0.3 93.8 13 0.3 93.2 21 0.4 93.7127 24 0.5 93.8 28 0.6 94.4 26 0.5 93.8 24 0.5 94.2128 8 0.2 94 9 0.2 94.5 6 0.1 93.9 11 0.2 94.4129 20 0.4 94.4 19 0.4 94.9 18 0.4 94.2 12 0.2 94.7130 11 0.2 94.7 14 0.3 95.2 8 0.2 94.4 10 0.2 94.9131 14 0.3 95 16 0.3 95.5 20 0.4 94.8 14 0.3 95.1132 11 0.2 95.2 8 0.2 95.7 12 0.2 95 6 0.1 95.3133 14 0.3 95.5 10 0.2 95.9 19 0.4 95.4 16 0.3 95.6134 14 0.3 95.8 6 0.1 96 9 0.2 95.6 13 0.3 95.8135 8 0.2 96 15 0.3 96.3 17 0.3 95.9 12 0.2 96.1136 6 0.1 96.1 6 0.1 96.4 6 0.1 96.1 10 0.2 96.3137 8 0.2 96.3 18 0.4 96.8 13 0.3 96.3 13 0.3 96.5138 12 0.3 96.5 9 0.2 97 8 0.2 96.5 6 0.1 96.7139 7 0.1 96.7 11 0.2 97.2 13 0.3 96.7 8 0.2 96.8140 9 0.2 96.9 2 0 97.2 5 0.1 96.8 10 0.2 97141 3 0.1 96.9 10 0.2 97.4 10 0.2 97 13 0.3 97.3142 8 0.2 97.1 4 0.1 97.5 5 0.1 97.1 10 0.2 97.5143 8 0.2 97.3 5 0.1 97.6 7 0.1 97.3 6 0.1 97.6144 10 0.2 97.5 5 0.1 97.7 3 0.1 97.3 5 0.1 97.7145 4 0.1 97.6 10 0.2 97.9 8 0.2 97.5 5 0.1 97.8146 11 0.2 97.8 5 0.1 98 8 0.2 97.7 5 0.1 97.9147 5 0.1 97.9 7 0.1 98.1 12 0.2 97.9 5 0.1 98148 4 0.1 98 4 0.1 98.2 4 0.1 98 3 0.1 98.1
Microengine0 Microengine1 Microengine2 Microengine3
115
Table D-1. SDRAM Latency Data (3)
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative149 4 0.1 98.1 8 0.2 98.4 9 0.2 98.2 4 0.1 98.1150 10 0.2 98.3 3 0.1 98.4 1 0 98.2 5 0.1 98.2151 1 0 98.3 5 0.1 98.5 4 0.1 98.3 7 0.1 98.4152 7 0.1 98.5 4 0.1 98.6 5 0.1 98.4 1 0 98.4153 2 0 98.5 3 0.1 98.7 3 0.1 98.4 6 0.1 98.5154 2 0 98.5 4 0.1 98.8 2 0 98.5 1 0 98.5155 6 0.1 98.7 3 0.1 98.8 9 0.2 98.6 5 0.1 98.6156 2 0 98.7 2 0 98.9 3 0.1 98.7 4 0.1 98.7157 4 0.1 98.8 6 0.1 99 5 0.1 98.8 5 0.1 98.8158 2 0 98.8 0 0 99 4 0.1 98.9 1 0 98.8159 5 0.1 98.9 1 0 99 3 0.1 98.9 2 0 98.9160 4 0.1 99 1 0 99 3 0.1 99 3 0.1 98.9161 5 0.1 99.1 3 0.1 99.1 5 0.1 99.1 5 0.1 99162 2 0 99.2 1 0 99.1 4 0.1 99.2 3 0.1 99.1163 1 0 99.2 5 0.1 99.2 3 0.1 99.2 3 0.1 99.2164 2 0 99.2 2 0 99.2 1 0 99.3 4 0.1 99.2165 1 0 99.3 4 0.1 99.3 7 0.1 99.4 4 0.1 99.3166 1 0 99.3 2 0 99.4 0 0 99.4 4 0.1 99.4167 3 0.1 99.3 3 0.1 99.4 1 0 99.4 2 0 99.4168 1 0 99.4 1 0 99.4 1 0 99.4 3 0.1 99.5169 3 0.1 99.4 2 0 99.5 6 0.1 99.6 1 0 99.5170 2 0 99.5 0 0 99.5 0 0 99.6 0 0 99.5171 3 0.1 99.5 2 0 99.5 4 0.1 99.6 4 0.1 99.6172 1 0 99.6 1 0 99.5 0 0 99.6 0 0 99.6173 2 0 99.6 1 0 99.6 2 0 99.7 1 0 99.6174 2 0 99.6 1 0 99.6 1 0 99.7 0 0 99.6175 1 0 99.7 2 0 99.6 2 0 99.7 0 0 99.6176 3 0.1 99.7 1 0 99.6 0 0 99.7 1 0 99.6177 0 0 99.7 2 0 99.7 0 0 99.7 1 0 99.7178 0 0 99.7 1 0 99.7 0 0 99.7 2 0 99.7179 0 0 99.7 1 0 99.7 0 0 99.7 3 0.1 99.8180 0 0 99.7 0 0 99.7 0 0 99.7 1 0 99.8181 0 0 99.7 0 0 99.7 1 0 99.8 0 0 99.8182 0 0 99.7 0 0 99.7 1 0 99.8 0 0 99.8183 0 0 99.7 2 0 99.8 0 0 99.8 2 0 99.8184 0 0 99.7 0 0 99.8 1 0 99.8 0 0 99.8185 1 0 99.7 0 0 99.8 1 0 99.8 0 0 99.8186 1 0 99.8 0 0 99.8 1 0 99.8 0 0 99.8187 1 0 99.8 1 0 99.8 1 0 99.9 0 0 99.8188 0 0 99.8 2 0 99.8 2 0 99.9 0 0 99.8189 1 0 99.8 0 0 99.8 1 0 99.9 1 0 99.8190 0 0 99.8 0 0 99.8 0 0 99.9 0 0 99.8191 1 0 99.8 1 0 99.8 2 0 100 0 0 99.8192 0 0 99.8 0 0 99.8 0 0 100 0 0 99.8193 3 0.1 99.9 2 0 99.9 0 0 100 1 0 99.9194 2 0 99.9 1 0 99.9 0 0 100 1 0 99.9195 0 0 99.9 1 0 99.9 1 0 100 1 0 99.9196 1 0 100 1 0 99.9 0 0 100 2 0 99.9197 1 0 100 0 0 99.9 0 0 100 0 0 99.9198 0 0 100 0 0 99.9 0 0 100 0 0 99.9199 0 0 100 0 0 99.9 1 0 100 0 0 99.9200 0 0 100 0 0 99.9 0 0 99.9201 0 0 100 0 0 99.9 0 0 99.9
Microengine0 Microengine1 Microengine2 Microengine3
116
Table D-1. SDRAM Latency Data (4)
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative202 0 0 100 0 0 99.9 1 0 100203 0 0 100 0 0 99.9 0 0 100204 0 0 100 0 0 99.9 0 0 100205 0 0 100 0 0 99.9 0 0 100206 0 0 100 0 0 99.9 0 0 100207 0 0 100 0 0 99.9 0 0 100208 0 0 100 0 0 99.9 0 0 100209 0 0 100 0 0 99.9 0 0 100210 0 0 100 0 0 99.9 0 0 100211 1 0 100 0 0 99.9 0 0 100212 0 0 99.9 0 0 100213 0 0 99.9 2 0 100214 1 0 100215 0 0 100216 0 0 100217 0 0 100218 0 0 100219 1 0 100220 1 0 100
Microengine0 Microengine1 Microengine2 Microengine3
117
Table D-2. SRAM Latency (unlocked) Data (1)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative16 1543 10.5 10.5 1774 10.9 10.9 1842 11.3 11.3 1719 10.6 10.617 1091 7.4 17.9 1309 8.1 19 1319 8.1 19.5 1343 8.3 18.918 594 4 22 668 4.1 23.1 674 4.1 23.6 638 3.9 22.8 818 16.4 16.4 774 16.6 16.619 413 2.8 24.8 481 3 26.1 463 2.9 26.5 477 2.9 25.7 961 19.3 35.7 878 18.8 35.320 1078 7.3 32.2 1095 6.7 32.8 1115 6.9 33.3 1102 6.8 32.5 441 8.8 44.5 349 7.5 42.821 972 6.6 38.8 1014 6.2 39 1065 6.6 39.9 1011 6.2 38.7 343 6.9 51.4 330 7.1 49.922 645 4.4 43.2 795 4.9 43.9 767 4.7 44.6 745 4.6 43.3 332 6.7 58 282 6 55.923 530 3.6 46.8 620 3.8 47.8 593 3.7 48.3 640 3.9 47.3 320 6.4 64.5 278 5.9 61.924 580 4 50.7 666 4.1 51.9 644 4 52.2 686 4.2 51.5 332 6.7 71.1 322 6.9 68.725 481 3.3 54 545 3.4 55.2 529 3.3 55.5 547 3.4 54.9 310 6.2 77.3 336 7.2 75.926 563 3.8 57.8 552 3.4 58.6 546 3.4 58.8 570 3.5 58.4 188 3.8 81.1 220 4.7 80.627 464 3.2 61 488 3 61.6 473 2.9 61.8 515 3.2 61.5 226 4.5 85.6 222 4.7 85.428 455 3.1 64.1 467 2.9 64.5 499 3.1 64.8 471 2.9 64.4 145 2.9 88.5 141 3 88.429 376 2.6 66.7 374 2.3 66.8 391 2.4 67.2 386 2.4 66.8 130 2.6 91.1 128 2.7 91.130 349 2.4 69.1 385 2.4 69.2 389 2.4 69.6 421 2.6 69.4 77 1.5 92.7 75 16 92.731 301 2.1 71.1 323 2 71.1 342 2.1 71.7 330 2 71.4 65 1.3 94 59 1.3 9432 310 2.1 73.2 306 1.9 73 328 2 73.8 343 2.1 73.6 43 0.9 94.8 47 1 9533 218 1.5 74.7 259 1.6 74.6 260 1.6 75.4 271 1.7 75.2 36 0.7 95.6 35 0.7 95.834 230 1.6 76.3 268 1.7 76.3 261 1.6 77 277 1.7 76.9 26 0.5 96.1 22 0.5 96.235 226 1.5 77.8 238 1.5 77.7 200 1.2 78.2 232 1.4 78.4 26 0.5 96.6 22 0.5 96.736 194 1.3 79.1 236 1.5 79.2 236 1.5 79.7 216 1.3 79.7 20 0.4 97 15 0.3 9737 152 1 80.2 170 1 80.2 162 1 80.6 194 1.2 80.9 12 0.2 97.3 23 0.5 97.538 184 1.3 81.4 223 1.4 81.6 184 1.1 81.8 198 1.2 82.1 20 0.4 97.7 8 0.2 97.739 150 1 82.4 149 0.9 82.5 122 0.8 82.5 170 1 83.2 15 0.3 98 15 0.3 9840 162 1.1 83.5 178 1.1 83.6 133 0.8 83.4 146 0.9 84 8 0.2 98.1 8 0.2 98.241 120 0.8 84.4 125 0.8 84.4 145 0.9 84.2 137 0.8 84.9 9 0.2 98.3 6 0.1 98.342 149 1 85.4 139 0.9 85.3 149 0.9 85.2 132 0.8 85.7 9 0.2 98.5 15 0.3 98.643 90 0.6 86 111 0.7 85.9 126 0.8 85.9 106 0.7 86.4 8 0.2 98.6 3 0.1 98.744 132 0.9 86.9 115 0.7 86.6 135 0.8 86.8 118 0.7 87.1 3 0.1 98.7 5 0.1 98.845 106 0.7 87.6 101 0.6 87.3 93 0.6 87.3 109 0.7 87.8 4 0.1 98.8 4 0.1 98.946 114 0.8 88.4 115 0.7 88 94 0.6 87.9 113 0.7 88.5 3 0.1 98.8 5 0.1 9947 67 0.5 88.8 95 0.6 88.6 84 0.5 88.4 97 0.6 89.1 3 0.1 98.9 3 0.1 99.148 97 0.7 89.5 85 0.5 89.1 88 0.5 89 87 0.5 89.6 3 0.1 99 7 0.1 99.249 80 0.5 90.1 84 0.5 89.6 85 0.5 89.5 76 0.5 90.1 4 0.1 99 3 0.1 99.350 86 0.6 90.6 106 0.7 90.3 96 0.6 90.1 97 0.6 90.7 4 0.1 99.1 2 0 99.351 72 0.5 91.1 67 0.4 90.7 65 0.4 90.5 72 0.4 91.1 4 0.1 99.2 2 0 99.452 54 0.4 91.5 70 0.4 91.1 80 0.5 91 77 0.5 91.6 9 0.2 99.4 6 0.1 99.553 62 0.4 91.9 76 0.5 91.6 74 0.5 91.4 72 0.4 92 4 0.1 99.5 1 0 99.554 63 0.4 92.3 85 0.5 92.1 78 0.5 91.9 79 0.5 92.5 3 0.1 99.5 3 0.1 99.655 55 0.4 92.7 66 0.4 92.5 53 0.3 92.2 47 0.3 92.8 6 0.1 99.6 1 0 99.656 71 0.5 93.2 70 0.4 92.9 75 0.5 92.7 84 0.5 93.3 1 0 99.7 4 0.1 99.757 44 0.3 93.5 58 0.4 93.3 51 0.3 93 52 0.3 93.6 2 0 99.7 2 0 99.758 53 0.4 93.9 51 0.3 93.6 45 0.3 93.3 49 0.3 93.9 1 0 99.7 1 0 99.759 31 0.2 94.1 53 0.3 93.9 56 0.3 93.6 35 0.2 94.1 1 0 99.7 2 0 99.860 45 0.3 94.4 62 0.4 94.3 57 0.4 94 46 0.3 94.4 1 0 99.8 3 0.1 99.961 35 0.2 94.6 42 0.3 94.6 49 0.3 94.3 31 0.2 94.6 1 0 99.8 1 0 99.962 54 0.4 95 45 0.3 94.8 46 0.3 94.6 34 0.2 94.8 0 0 99.8 1 0 99.963 28 0.2 95.2 33 0.2 95 36 0.2 94.8 32 0.2 95 3 0.1 99.8 0 0 99.964 36 0.2 95.4 39 0.2 95.3 31 0.2 95 36 0.2 95.2 1 0 99.9 1 0 99.965 32 0.2 95.6 24 0.1 95.4 35 0.2 95.2 31 0.2 95.4 1 0 99.9 0 0 99.966 31 0.2 95.9 33 0.2 95.6 32 0.2 95.4 37 0.2 95.7 1 0 99.9 2 0 10067 19 0.1 96 25 0.2 95.8 32 0.2 95.6 25 0.2 95.8 0 0 99.9 0 0 10068 33 0.2 96.2 35 0.2 96 33 0.2 95.8 26 0.2 96 1 0 99.9 0 0 10069 26 0.2 96.4 33 0.2 96.2 23 0.1 95.9 22 0.1 96.1 0 0 99.9 0 0 10070 24 0.2 96.6 26 0.2 96.4 20 0.1 96.1 31 0.2 96.3 0 0 99.9 0 0 10071 20 0.1 96.7 19 0.1 96.5 31 0.2 96.3 19 0.1 96.4 2 0 100 0 0 10072 20 0.1 96.8 27 0.2 96.7 38 0.2 96.5 35 0.2 96.6 1 0 100 1 0 10073 12 0.1 96.9 22 0.1 96.8 24 0.1 96.6 18 0.1 96.7 0 0 100 1 0 10074 22 0.1 97.1 22 0.1 96.9 27 0.2 96.8 34 0.2 97 0 0 10075 13 0.1 97.1 12 0.1 97 14 0.1 96.9 27 0.2 97.1 0 0 10076 24 0.2 97.3 19 0.1 97.1 27 0.2 97.1 18 0.1 97.2 0 0 10077 18 0.1 97.4 17 0.1 97.2 23 0.1 97.2 20 0.1 97.4 0 0 10078 16 0.1 97.5 18 0.1 97.3 22 0.1 97.3 21 0.1 97.5 0 0 10079 9 0.1 97.6 18 0.1 97.4 29 0.2 97.5 10 0.1 97.5 1 0 10080 18 0.1 97.7 15 0.1 97.5 23 0.1 97.7 13 0.1 97.681 19 0.1 97.9 18 0.1 97.6 12 0.1 97.7 17 0.1 97.782 16 0.1 98 23 0.1 97.8 13 0.1 97.8 12 0.1 97.883 22 0.1 98.1 9 0.1 97.8 13 0.1 97.9 12 0.1 97.984 13 0.1 98.2 17 0.1 97.9 17 0.1 98 18 0.1 9885 6 0 98.2 18 0.1 98.1 13 0.1 98.1 13 0.1 98.186 13 0.1 98.3 14 0.1 98.1 12 0.1 98.2 17 0.1 98.287 7 0 98.4 16 0.1 98.2 6 0 98.2 17 0.1 98.3
Microengine3 Microengine4 Microengine5Microengine0 Microengine1 Microengine2
118
Table D-2. SRAM Latency (unlocked) Data (2)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative88 15 0.1 98.5 11 0.1 98.3 10 0.1 98.3 16 0.1 98.489 10 0.1 98.5 8 0 98.4 13 0.1 98.3 8 0 98.490 12 0.1 98.6 8 0 98.4 17 0.1 98.4 13 0.1 98.591 6 0 98.7 12 0.1 98.5 11 0.1 98.5 5 0 98.592 4 0 98.7 5 0 98.5 8 0 98.6 12 0.1 98.693 3 0 98.7 12 0.1 98.6 11 0.1 98.6 10 0.1 98.794 10 0.1 98.8 6 0 98.6 15 0.1 98.7 13 0.1 98.895 8 0.1 98.8 10 0.1 98.7 12 0.1 98.8 7 0 98.896 8 0.1 98.9 14 0.1 98.8 1 0 98.8 5 0 98.897 6 0 98.9 7 0 98.8 8 0 98.8 14 0.1 98.998 3 0 99 9 0.1 98.9 5 0 98.9 7 0 9999 2 0 99 3 0 98.9 4 0 98.9 10 0.1 99
100 5 0 99 10 0.1 98.9 8 0 98.9 4 0 99101 6 0 99 10 0.1 99 1 0 99 7 0 99.1102 5 0 99.1 0 0 99 11 0.1 99 7 0 99.1103 4 0 99.1 7 0 99.1 8 0 99.1 5 0 99.2104 4 0 99.1 3 0 99.1 5 0 99.1 2 0 99.2105 6 0 99.2 5 0 99.1 5 0 99.1 5 0 99.2106 6 0 99.2 1 0 99.1 8 0 99.2 6 0 99.2107 3 0 99.2 4 0 99.1 5 0 99.2 4 0 99.3108 8 0.1 99.3 5 0 99.2 8 0 99.3 6 0 99.3109 2 0 99.3 6 0 99.2 5 0 99.3 2 0 99.3110 0 0 99.3 8 0 99.2 6 0 99.3 2 0 99.3111 4 0 99.3 8 0 99.3 2 0 99.3 1 0 99.3112 2 0 99.3 9 0 99.4 5 0 99.4 8 0 99.4113 4 0 99.4 0 0 99.4 4 0 99.4 2 0 99.4114 4 0 99.4 4 0 99.4 6 0 99.4 4 0 99.4115 4 0 99.4 2 0 99.4 5 0 99.5 11 0.1 99.5116 3 0 99.4 4 0 99.4 8 0 99.5 2 0 99.5117 3 0 99.5 2 0 99.4 4 0 99.5 4 0 99.5118 4 0 99.5 4 0 99.5 4 0 99.6 2 0 99.5119 2 0 99.5 4 0 99.5 4 0 99.6 2 0 99.6120 2 0 99.5 1 0 99.5 7 0 99.6 4 0 99.6121 7 0 99.6 5 0 99.5 2 0 99.6 4 0 99.6122 1 0 99.6 4 0 99.5 1 0 99.6 5 0 99.6123 1 0 99.6 2 0 99.6 4 0 99.7 4 0 99.7124 4 0 99.6 6 0 99.6 2 0 99.7 1 0 99.7125 3 0 99.6 3 0 99.6 2 0 99.7 5 0 99.7126 2 0 99.6 1 0 99.6 1 0 99.7 1 0 99.7127 3 0 99.7 1 0 99.6 1 0 99.7 5 0 99.8128 5 0 99.7 0 0 99.6 2 0 99.7 4 0 99.8129 1 0 99.7 2 0 99.6 4 0 99.7 3 0 99.8130 3 0 99.7 0 0 99.6 4 0 99.8 2 0 99.8131 0 0 99.7 2 0 99.6 0 0 99.8 0 0 99.8132 3 0 99.7 1 0 99.6 2 0 99.8 1 0 99.8133 3 0 99.8 0 0 99.6 1 0 99.8 3 0 99.8134 1 0 99.8 4 0 99.7 3 0 99.8 1 0 99.8135 1 0 99.8 0 0 99.7 2 0 99.8 0 0 99.8136 2 0 99.8 2 0 99.7 1 0 99.8 3 0 99.8137 0 0 99.8 2 0 99.7 0 0 99.8 1 0 99.8138 1 0 99.8 1 0 99.7 2 0 99.8 1 0 99.8139 1 0 99.8 2 0 99.7 1 0 99.8 1 0 99.9140 2 0 99.8 1 0 99.7 0 0 99.8 0 0 99.9141 2 0 99.8 4 0 99.7 0 0 99.8 0 0 99.9142 0 0 99.8 0 0 99.7 2 0 99.9 0 0 99.9143 2 0 99.9 1 0 99.8 0 0 99.9 1 0 99.9144 1 0 99.9 3 0 99.8 2 0 99.9 1 0 99.9145 0 0 99.9 2 0 99.8 2 0 99.9 2 0 99.9146 2 0 99 1 0 99.8 1 0 99.9 0 0 99.9147 1 0 99.9 0 0 99.8 0 0 99.9 0 0 99.9148 0 0 99.9 1 0 99.8 0 0 99.9 0 0 99.9149 0 0 99.9 1 0 99.8 0 0 99.9 1 0 99.9150 2 0 99.9 2 0 99.8 2 0 99.9 0 0 99.9151 0 0 99.9 3 0 99.8 1 0 99.9 0 0 99.9152 0 0 99.9 1 0 99.8 2 0 99.9 2 0 99.9153 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9154 0 0 99.9 1 0 99.8 1 0 99.9 1 0 99.9155 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9156 0 0 99.9 0 0 99.8 0 0 99.9 1 0 99.9157 0 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9158 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9159 2 0 99.9 0 0 99.9 0 0 99.9 2 0 99.9
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
119
Table D-2. SRAM Latency (unlocked) Data (3)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative160 0 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9161 1 0 99.9 1 0 99.9 0 0 99.9 0 0 99.9162 2 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9163 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9164 1 0 99.9 2 0 99.9 2 0 99.9 1 0 99.9165 0 0 99.9 0 0 99.9 0 0 99.9 0 0 99.9166 0 0 99.9 1 0 99.9 1 0 99.9 0 0 99.9167 0 0 99.9 0 0 99.9 2 0 100 0 0 99.9168 0 0 99.9 0 0 99.9 0 0 100 0 0 99.9169 2 0 99.9 0 0 99.9 0 0 100 0 0 99.9170 0 0 99.9 0 0 99.9 0 0 100 0 0 99.9171 1 0 100 2 0 99.9 0 0 100 1 0 99.9172 0 0 100 0 0 99.9 0 0 100 1 0 100173 2 0 100 1 0 99.9 0 0 100 0 0 100174 0 0 100 1 0 99.9 1 0 100 0 0 100175 1 0 100 1 0 99.9 0 0 100 0 0 100176 0 0 100 0 0 99.9 0 0 100 0 0 100177 0 0 100 1 0 99.9 0 0 100 0 0 100178 0 0 100 2 0 99.9 0 0 100 0 0 100179 0 0 100 2 0 99.9 0 0 100 1 0 100180 0 0 100 0 0 99.9 1 0 100 2 0 100181 0 0 100 0 0 99.9 0 0 100 1 0 100182 0 0 100 0 0 99.9 0 0 100 0 0 100183 0 0 100 1 0 100 0 0 100 0 0 100184 1 0 100 0 0 100 0 0 100 0 0 100185 0 0 100 1 0 100 1 0 100 1 0 100186 1 0 100 0 0 100 0 0 100 0 0 100187 0 0 100 0 0 100 0 0 100 0 0 100188 0 0 100 0 0 100 0 0 100 0 0 100189 0 0 100 1 0 100 0 0 100 0 0 100190 0 0 100 1 0 100 0 0 100 0 0 100191 0 0 100 0 0 100 0 0 100 2 0 100192 0 0 100 2 0 100 0 0 100 0 0 100193 0 0 100 0 0 100 0 0 100 0 0 100194 0 0 100 0 0 100 1 0 100 0 0 100195 0 0 100 0 0 100 0 0 100 0 0 100196 0 0 100 0 0 100 0 0 100 0 0 100197 0 0 100 0 0 100 0 0 100 1 0 100198 0 0 100 1 0 100 0 0 100199 2 0 100 0 0 100 0 0 100200 0 0 100 0 0 100201 0 0 100 0 0 100202 1 0 100 0 0 100203 0 0 100 0 0 100204 1 0 100 0 0 100205 0 0 100206 1 0 100207 0 0 100208 0 0 100209 0 0 100210 0 0 100211 0 0 100212 0 0 100213 0 0 100214 0 0 100215 0 0 100216 0 0 100217 0 0 100218 0 0 100219 0 0 100220 0 0 100221 0 0 100222 0 0 100223 0 0 100224 0 0 100225 0 0 100226 0 0 100227 0 0 100228 0 0 100229 0 0 100230 0 0 100231 1 0 100
Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine5
232 0 0 100233 1 0 100
120
Table D-3. SRAM Latency (locked) Data (1)
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative20 222 10.2 10.2 229 9.2 9.2 241 9.7 9.7 228 9.1 9.1 1087 21.8 21.8 1024 21.9 21.921 387 17.7 27.9 386 15.5 24.6 385 15.4 25.1 408 16.3 25.5 1237 248 46.6 1077 23 4522 89 4.1 32 92 3.7 28.3 91 3.6 28.7 111 4.4 29.9 353 7.1 53.7 288 6.2 51.123 132 6 38 154 6.2 34.5 150 6 34.7 140 5.6 35.5 286 5.7 59.4 277 5.9 5724 93 4.3 42.3 111 4.4 38.9 113 4.5 39.2 103 4.1 39.7 388 7.8 67.2 369 7.9 64.925 136 6.2 48.5 157 6.3 45.2 164 6.6 45.8 140 5.6 45.3 425 8.5 75.7 410 8.8 73.726 68 3.1 51.6 86 3.4 48.7 67 2.7 48.5 86 3.4 48.7 274 5.5 81.2 252 5.4 79.127 98 4.5 56.1 101 4 52.7 121 4.8 53.3 110 4.4 53.1 219 4.4 85.6 237 5.1 84.228 48 2.2 58.3 61 2.4 55.1 69 2.8 56.1 61 2.4 55.6 162 3.2 88.8 153 3.3 87.429 82 3.8 62 101 4 59.2 90 3.6 59.7 90 3.6 59.2 10 2 90.8 121 2.6 9030 42 1.9 64 47 1.9 61.1 51 2 61.8 39 1.6 60.7 77 1.5 92.4 91 1.9 9231 58 2.7 66.6 82 3.3 64.4 77 3.1 64.8 99 4 64.7 54 1.1 93.4 53 1.1 93.132 39 1.8 68.4 41 1.6 66 48 1.9 66.8 43 1.7 66.4 34 0.7 94.1 31 0.7 93.833 48 2.2 70.6 56 2.2 68.2 68 2.7 69.5 74 3 69.4 34 0.7 94.8 32 0.7 94.534 33 1.5 72.1 33 1.3 69.6 41 1.6 71.1 25 1 70.4 17 0.3 95.1 27 0.6 9535 48 2.2 74.3 43 1.7 71.3 53 2.1 73.2 59 2.4 72.8 21 0.4 95.6 21 0.4 95.536 27 1.2 75.5 23 0.9 72.21 33 1.3 74.6 28 1.1 73.9 15 0.3 95.9 14 0.3 95.837 42 1.9 77.5 49 2 74.2 35 1.4 76 49 2 75.8 12 0.2 96.1 10 0.2 9638 22 1 78.5 21 0.8 75 32 1.3 77.3 17 0.7 76.5 5 0.1 96.2 19 0.4 96.439 26 1.2 79.7 33 1.3 76.3 45 1.8 79.1 23 0.9 77.4 11 0.2 96.4 13 0.3 96.740 20 0.9 80.6 20 0.8 77.1 15 0.6 79.7 20 0.8 78.2 15 0.3 96.7 4 0.1 96.841 24 1.1 81.7 36 1.4 78.6 30 1.2 80.9 31 1.2 79.5 4 0.1 96.8 10 0.2 9742 19 0.9 82.6 24 1 79.5 27 1.1 81.9 12 0.5 80 9 0.2 97 5 0.1 97.143 16 0.7 83.3 22 0.9 80.4 26 1 83 30 1.2 81.2 7 0.1 97.1 5 0.1 97.244 12 0.5 83.8 21 0.8 81.3 13 0.5 83.5 19 0.8 81.9 5 0.1 97.2 4 0.1 97.345 17 0.8 84.6 27 1.1 82.3 15 0.6 84.1 22 0.9 82.8 8 0.2 97.4 3 0.1 97.346 9 0.4 85 12 0.5 82.8 16 0.6 84.7 19 0.8 83.6 1 0 97.4 5 0.1 97.547 25 1.1 86.2 28 1.1 83.9 26 1 85.8 17 0.7 84.3 5 0.1 97.5 4 0.1 97.548 15 0.7 86.9 15 0.6 84.5 6 0.2 86 21 0.8 85.1 5 0.1 97.6 1 0 97.649 25 1.1 88 24 1 85.5 12 0.5 86.5 19 0.8 85.9 3 0.1 97.7 2 0 97.650 7 0.3 88.3 6 0.2 85.7 15 0.6 87.1 16 0.6 86.5 2 0 97.7 3 0.1 97.751 10 0.5 88.8 10 0.4 86.1 12 0.5 87.6 20 0.8 87.3 4 0.1 97.8 1 0 97.752 7 0.3 89.1 18 0.7 86.9 6 0.2 87.8 6 0.2 87.5 2 0 97.8 4 0.1 97.853 7 0.3 89.4 10 0.4 87.3 11 0.4 88.3 12 0.5 88 2 0 97.9 6 0.1 97.954 7 0.3 89.7 4 0.2 87.4 7 0.3 88.5 4 0.2 88.2 2 0 97.9 3 0.1 9855 7 0.3 90.1 10 0.4 87.8 10 0.4 88.9 21 0.8 89 2 0 98 3 0.1 9856 11 0.5 90.6 9 0.4 88.2 6 0.2 89.2 8 0.3 89.3 3 0.1 98 5 0.1 98.157 5 0.2 90.8 13 0.5 88.7 10 0.4 89.6 7 0.3 89.6 2 0 98.1 5 0.1 98.258 7 0.3 91.1 8 0.23 89 4 0.2 89.7 3 0.1 89.7 2 0 98.1 0 0 98.259 6 0.3 91.4 18 0.7 89.7 16 0.6 90.4 11 0.4 90.2 2 0 98.1 2 0 98.360 5 0.2 91.6 2 0.1 89.8 7 0.3 90.7 4 0.2 90.3 3 0.1 98.2 1 0 98.361 7 0.3 91.9 10 0.4 90.2 10 0.4 91.1 7 0.3 90.6 3 0.1 98.3 2 0 98.462 2 0.1 92 6 0.2 90.5 4 0.2 91.2 7 0.3 90.9 0 0 98.3 0 0 98.463 4 0.2 92.2 10 0.4 90.9 3 1 91.3 10 0.4 91.3 3 0.1 98.3 3 0.1 98.464 3 0.1 92.4 6 0.2 91.1 4 0.2 91.5 7 0.3 91.6 2 0 98.4 3 0.1 98.565 6 0.3 92.6 5 0.2 91.3 4 0.2 91.7 6 0.2 91.8 1 0 98.4 3 0.1 98.566 3 0.1 92.8 4 0.2 91.5 1 0 91.7 5 0.2 92 1 0 98.4 3 0.1 98.667 1 0 92.8 6 0.2 91.7 11 0.4 92.2 6 0.2 92.3 6 0.1 98.5 1 0 98.668 3 0.1 92.9 5 0.2 91.9 2 0.1 92.2 2 0.1 92.3 1 0 98.5 3 0.1 98.769 1 0 93 9 0.4 92.3 6 0.2 92.5 7 0.3 92.6 2 0 98.6 2 0 98.770 6 0.3 93.3 1 0 92.3 7 0.3 92.8 4 0.2 92.8 1 0 98.6 1 0 98.871 5 0.2 93.5 5 0.2 92.5 4 0.2 92.9 6 0.2 93 4 0.1 98.7 3 0.1 98.872 1 0 93.5 1 0 92.6 4 0.2 93.1 3 0.1 93.1 5 0.1 98.8 5 0.1 98.973 9 0.4 94 4 0.2 92.7 7 0.3 93.4 6 0.2 93.4 2 0 98.8 1 0 9974 4 0.2 94.1 1 0 92.8 3 0.1 93.5 6 0.2 93.6 3 0.1 98.9 0 0 9975 2 0.1 94.2 9 0.4 93.1 6 0.2 93.7 7 0.3 93.9 1 0 98.9 1 0 9976 5 0.2 94.5 2 0.1 93.2 2 0.1 93.8 1 0 94 3 0.1 99 1 0 9977 3 0.1 94.6 5 0.2 93.4 2 0.1 93.9 3 0.1 94.1 1 0 99 2 0 9978 2 0.1 94.7 1 0 93.4 2 0.1 94 1 0 94.1 1 0 99 0 0 9979 2 0.1 94.8 4 0.2 93.6 4 0.2 94.1 5 0.2 94.3 5 0.1 99.1 3 0.1 99.180 1 0 94.8 4 0.2 93.8 1 0 94.2 2 0.1 94.4 1 0 99.1 0 0 99.181 2 0.1 94.9 6 0.2 94 1 0 94.2 3 0.1 94.5 2 0 99.2 1 0 99.182 3 0.1 95.1 4 0.2 94.2 3 0.1 94.3 1 0 94.6 0 0 99.2 3 0.1 99.283 1 0 95.1 2 0.1 94.2 3 0.1 94.4 2 0.1 94.6 2 0 99.2 3 0.1 99.384 1 0 95.1 3 0.1 94.4 1 0 94.5 0 0 94.6 2 0 99.2 1 0 99.385 4 0.2 95.3 3 0.1 94.5 3 0.1 94.6 3 0.1 94.8 3 0.1 99.3 1 0 99.386 0 0 95.3 0 0.1 2 0.1 94.7 3 0.1 94.9 1 0 99.3 1 0 99.387 5 0.2 95.6 2 0.1 94.6 2 0.1 94.8 4 0.2 95 0 0 99.3 2 0 99.4
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
121
Table D-3. SRAM Latency (locked) Data (2)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative88 2 0.1 95.7 2 0.1 94.6 4 0.2 94.9 2 0.1 95.1 1 0 99.3 0 0 99.489 2 0.1 95.7 2 0.1 94.7 1 0 95 9 0.4 95.5 2 0 99.4 2 0 99.490 1 0 95.8 5 0.2 94.9 2 0.1 95 5 0.2 95.7 2 0 99.4 0 0 99.491 2 0.1 95.9 2 0.1 95 1 0 95.1 2 0.1 95.8 5 0.1 99.5 0 0 99.492 1 0 95.9 1 0 95 2 0.1 95.2 3 0.1 95.9 1 0 99.5 2 0 99.493 3 0.1 96.1 2 0.1 95.1 4 0.2 95.3 2 0.1 96 1 0 99.6 3 0.1 99.594 2 0.1 96.2 1 0 95.2 1 0 95.4 1 0 96 1 0 99.6 2 0 99.695 5 0.2 96.4 3 0.1 95.3 1 0 95.4 4 0.2 96.2 1 0 99.6 0 0 99.696 1 0 96.4 2 0.1 95.4 3 0.1 95.5 2 0.1 96.2 1 0 99.6 2 0 99.697 2 0.1 96.5 3 0.1 95.5 3 0.1 95.6 2 0.1 96.3 1 0 99.6 0 0 99.698 0 0 96.5 2 0.1 95.6 2 0.1 95.7 3 0.1 96.4 0 0 99.6 1 0 99.699 1 0 96.6 2 0.1 95.6 1 0 95.8 2 0.1 96.5 2 0 99.7 0 0 99.6
100 0 0 96.6 2 0.1 95.7 1 0 95.8 0 0 96.5 3 0.1 99.7 1 0 99.6101 1 0 96.6 2 0.1 95.8 1 0 95.8 1 0 96.6 1 0 99.8 0 0 99.6102 0 0 96.6 2 0.1 95.9 3 0.1 96 1 0 96.6 0 0 99.8 2 0 99.7103 4 0.2 96.8 4 0.2 96 3 0.1 96.1 4 0.2 96.8 1 0 99.8 1 0 99.7104 1 0 96.8 0 0 96 1 0 96.1 2 0.1 96.8 0 0 99.8 0 0 99.7105 5 0.2 97.1 1 0 96.1 2 0.1 96.2 4 0.2 97 0 0 99.8 0 0 99.7106 0 0 97.1 0 0 96.1 3 0.1 96.3 1 0 97 0 0 99.8 2 0 99.7107 0 0 97.1 2 0.1 96.2 3 0.1 96.4 1 0 97.1 0 0 99.8 1 0 99.8108 1 0 97.1 1 0 96.2 3 0.1 96.6 3 0.1 97.2 1 0 99.8 0 0 99.8109 3 0.1 97.3 4 0.2 96.4 2 0.1 96.6 3 0.1 97.3 0 0 99.8 1 0 99.8110 2 0.1 97.3 1 0 96.4 1 0 96.7 2 0.1 97.4 0 0 99.8 0 0 99.8111 1 0 97.4 1 0 96.4 2 0.1 96.8 3 0.1 97.5 0 0 99.8 0 0 99.8112 0 0 97.4 3 0.1 96.6 1 0 96.8 3 0.1 97.6 1 0 99.8 0 0 99.8113 1 0 97.4 2 0.1 96.6 5 0.2 97 0 0 97.6 2 0 99.9 2 0 99.8114 0 0 97.4 1 0 96.7 1 0 97 3 0.1 97.8 1 0 99.9 1 0 99.9115 0 0 97.4 1 0 96.7 2 0.1 97.1 1 0 97.8 0 0 99.9 1 0 99.9116 0 0 97.4 1 0 96.8 1 0 97.2 0 0 97.8 0 0 99.9 0 0 99.9117 3 0.1 97.6 1 0 96.8 2 0.1 97.2 0 0 97.8 0 0 99.9 1 0 99.9118 0 0 97.6 0 0 96.8 2 0.1 97.3 3 0.1 97.9 0 0 99.9 0 0 99.9119 1 0 97.6 3 0.1 96.9 0 0 97.3 1 0 98 1 0 99.9 0 0 99.9120 2 0.1 97.7 1 0 97 0 0 97.3 0 0 98 1 0 99.9 0 0 99.9121 3 0.1 97.8 1 0 97 4 0.2 97.5 2 0.1 98 1 0 99.9 0 0 99.9122 1 0 97.9 1 0 97 0 0 97.5 3 0.1 98.2 0 0 99.9 1 0 99.9123 1 0 97.9 0 0 97 2 0.1 97.6 3 0.1 98.3 1 0 100 0 0 99.9124 0 0 97.9 1 0 97.1 1 0 97.6 0 0 98.3 0 0 100 0 0 99.9125 0 0 97.9 2 0.1 97.2 3 0.1 97.7 1 0 98.3 0 0 100 0 0 99.9126 1 0 98 3 0.1 97.3 3 0.1 97.8 1 0 98.4 1 0 100 0 0 99.9127 3 0.1 98.1 3 0.1 97.4 0 0 97.8 2 0.1 98.4 0 0 100 0 0 99.9128 1 0 98.2 1 0 97.4 1 0 97.9 2 0.1 98.5 0 0 100 0 0 99.9129 0 0 98.2 0 0 97.4 3 0.1 98 1 0 98.6 0 0 100 0 0 99.9130 0 0 98.2 1 0 97.5 2 0.1 98.1 1 0 98.6 0 0 100 0 0 99.9131 1 0 98.2 0 0 97.5 1 0 98.1 0 0 98.6 0 0 100 0 0 99.9132 1 0 98.3 0 0 97.5 0 0 98.1 1 0 98.6 0 0 100 0 0 99.9133 2 0.1 98.4 0 0 97.5 1 0.1 98.2 2 0.1 98.7 0 0 100 0 0 99.9134 0 0 98.4 3 0.1 97.6 0 0 98.2 2 0.1 98.8 0 0 100 0 0 99.9135 2 0.1 98.4 0 0 97.6 0 0 98.2 2 0.1 98.9 0 0 100 0 0 99.9136 0 0 98.4 1 0 97.6 2 0.1 98.2 2 0.1 99 0 0 100 1 0 99.9137 2 0.1 98.5 3 0.1 97.8 3 0.1 98.4 2 0.1 99 0 0 100 0 0 99.9138 1 0 98.6 2 0.1 97.8 2 0 98.4 2 0.1 99.1 0 0 100 0 0 99.9139 0 0 98.6 0 0 97.8 1 0 98.5 3 0.1 99.2 0 0 100 0 0 99.9140 0 0 98.6 1 0 97.9 1 0.1 98.5 0 0 99.2 0 0 100 0 0 99.9141 1 0 98.6 1 0 97.9 2 0.1 98.6 1 0 99.3 0 0 100 0 0 99.9142 0 0 98.6 2 0.1 98 0 0 98.6 0 0 99.3 0 0 100 0 0 99.9143 1 0 98.7 1 0 98 3 0.1 98.7 0 0 99.3 0 0 100 0 0 99.9144 1 0 98.7 0 0 98 0 0 98.7 0 0 99.3 0 0 100 0 0 99.9145 1 0 98.8 3 0.1 98.2 2 0 98.8 2 0.1 99.4 0 0 100 0 0 99.9146 1 0 98.8 2 0.1 98.2 1 0 98.8 0 0 99.4 0 0 100 0 0 99.9147 0 0 98.8 1 0 98.3 1 0 98.9 2 0.1 99.4 0 0 100 0 0 99.9148 0 0 98.8 1 0 98.3 0 0 98.9 1 0 99.5 0 0 100 0 0 99.9149 0 0 98.8 0 0 98.3 1 0 98.9 0 0 99.5 0 0 100 0 0 99.9150 1 0 98.9 3 0.1 98.4 0 0 98.9 1 0 99.5 0 0 100 0 0 99.9151 1 0 98.9 3 0.1 98.6 1 0 99 0 0 99.5 0 0 100 0 0 99.9152 0 0 98.9 2 0.1 98.6 1 0 99 0 0 99.5 0 0 100 0 0 99.9153 0 0 98.9 0 0 98.6 1 0 99 1 0 99.6 0 0 100 0 0 99.9154 0 0 98.9 0 0 98.6 0 0 99 1 0 99.6 0 0 100 1 0 100155 2 0.1 99 1 0 98.7 1 0 99.1 0 0 99.6 0 0 100 0 0 100156 1 0 99 2 0.1 98.8 1 0 99.1 0 0 99.6 0 0 100 0 0 100157 0 0 99 2 0.1 98.8 0 0 99.1 0 0 99.6 0 0 100 0 0 100158 0 0 99 0 0 98.8 0 0 99.1 1 0 99.6 0 0 100 1 0 100159 1 0 99.1 1 0 98.9 1 0 99.2 0 0 99.6 0 0 100 0 0 100
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
122
Table D-3. SRAM Latency (locked) Data (3)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative160 0 0 99.1 0 0 98.9 1 0 99.2 0 0 99.6 0 0 100 0 0 100161 1 0 99.1 2 0.1 99 1 0 99.2 0 0 99.6 0 0 100 0 0 100162 1 0 99.2 0 0 99 0 0 99.2 0 0 99.6 0 0 100 0 0 100163 0 0 99.2 0 0 99 1 0 99.3 0 0 99.6 0 0 100 0 0 100164 1 0 99.2 0 0 99 0 0 99.3 0 0 99.6 0 0 100 0 0 100165 0 0 99.2 1 0 99 1 0 99.3 0 0 99.6 0 0 100 0 0 100166 0 0 99.2 2 0.1 99.1 1 0 99.4 0 0 99.6 0 0 100 0 0 100167 0 0 99.2 1 0 99.1 0 0 99.4 0 0 99.6 0 0 100 0 0 100168 0 0 99.2 1 0 99.2 0 0 99.4 0 0 99.6 0 0 100 0 0 100169 0 0 99.2 0 0 99.2 2 1 99.4 0 0 99.6 0 0 100 0 0 100170 1 0 99.3 1 0 99.2 0 0 99.4 1 0 99.7 0 0 100 0 0 100171 3 0.1 99.4 4 0.2 99.4 0 0 99.4 0 0 99.7 0 0 100 0 0 100172 0 0 99.4 1 0 99.4 0 0 99.4 1 0 99.7 0 0 100 0 0 100173 0 0 99.4 1 0 99.4 1 0 99.5 0 0 99.7 0 0 100 0 0 100174 0 0 99.4 0 0 99.4 1 0 99.5 0 0 99.7 0 0 100 0 0 100175 0 0 99.4 1 0 99.5 1 0 99.6 0 0 99.7 0 0 100 0 0 100176 0 0 99.4 0 0 99.5 0 0 99.6 0 0 99.7 0 0 100 0 0 100177 1 0 99.5 1 0 99.5 0 0 99.6 1 0 99.8 0 0 100 0 0 100178 0 0 99.5 2 0.1 99.6 0 0 99.6 0 0 99.8 0 0 100 0 0 100179 0 0 99.5 2 0.1 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100180 1 0 99.5 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100181 0 0 99.5 1 0 99.7 1 0 99.6 0 0 99.8 0 0 100 0 0 100182 0 0 99.5 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100183 2 0.1 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100184 1 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 0 0 100185 0 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100 1 0 100186 0 0 99.6 0 0 99.7 0 0 99.6 0 0 99.8 0 0 100187 0 0 99.6 1 0 99.8 1 0 99.6 0 0 99.8 0 0 100188 0 0 99.6 0 0 99.8 1 0 99.7 0 0 99.8 0 0 100189 2 0.1 99.7 0 0 99.8 0 0 99.7 0 0 99.8 0 0 100190 0 0 99.7 0 0 99.8 1 0 99.7 0 0 99.8 0 0 100191 0 0 99.7 0 0 99.8 1 0 99.8 1 0 99.8 0 0 100192 0 0 99.7 1 0 99.8 0 0 99.8 0 0 99.8 0 0 100193 0 0 99.7 1 0 99.8 1 0 99.8 0 0 99.8 0 0 100194 0 0 99.7 1 0 99.9 0 0 99.8 0 0 99.8 0 0 100195 1 0 99.8 0 0 99.9 1 0 99.8 1 0 99.8 0 0 100196 1 0 99.8 0 0 99.9 1 0 99.9 0 0 99.8 0 0 100197 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.8 0 0 100198 0 0 99.8 0 0 99.9 0 0 99.9 1 0 99.9 0 0 100199 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100200 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100201 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 0 0 100202 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9 1 0 100203 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9204 0 0 99.8 0 0 99.9 1 0 99.9 0 0 99.9205 0 0 99.8 1 0 99.9 0 0 99.9 0 0 99.9206 0 0 99.8 0 0 99.9 0 0 99.9 0 0 99.9207 0 0 99.8 1 0 100 0 0 99.9 0 0 99.9208 0 0 99.8 0 0 100 0 0 99.9 0 0 99.9209 0 0 99.8 0 0 100 0 0 99.9 1 0 99.9210 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9211 0 0 99.9 0 0 100 0 0 99.9 0 0 99.9212 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9213 0 0 99.9 1 0 100 0 0 99.9 0 0 99.9214 0 0 99.9 0 0 99.9 0 0 99.9215 0 0 99.9 0 0 99.9 0 0 99.9216 0 0 99.9 0 0 99.9 0 0 99.9217 0 0 99.9 0 0 99.9 0 0 99.9218 1 0 100 0 0 99.9 0 0 99.9219 0 0 100 0 0 99.9 1 0 100220 0 0 100 0 0 99.9 0 0 100221 0 0 100 0 0 99.9 0 0 100222 0 0 100 0 0 99.9 0 0 100223 0 0 100 0 0 99.9 0 0 100224 0 0 100 0 0 99.9 0 0 100225 0 0 100 0 0 99.9 0 0 100226 0 0 100 0 0 99.9 0 0 100227 0 0 100 0 0 99.9 0 0 100228 0 0 100 0 0 99.9 0 0 100229 0 0 100 0 0 99.9 0 0 100230 0 0 100 1 0 100 0 0 100231 0 0 100 0 0 100 0 0 100
Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine5
123
Table D-3. SRAM Latency (locked) Data (4)
cycles#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative232 1 0 100 0 0 100 0 0 100233 0 0 100 0 0 100234 1 0 100 0 0 100234 0 0 100235 0 0 100236 0 0 100237 0 0 100238 0 0 100239 0 0 100240 0 0 100241 0 0 100242 0 0 100243 0 0 100245 0 0 100246 0 0 100247 0 0 100248 0 0 100249 0 0 100250 0 0 100251 1 0 100
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
124
Table D-4. Receive FIFO buffer Latency Data (1)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative15 383 5.6 5.6 478 6.4 6.4 450 6 6 464 6.2 6.216 731 10.6 16.2 800 10.7 17 784 10.5 16.5 845 11.3 17.517 214 3.1 19.3 226 3 20.1 257 3.4 19.9 241 3.2 20.718 331 4.8 24.1 364 4.9 24.9 349 4.7 24.5 329 4.4 25.119 171 2.5 26.6 207 2.8 27.7 184 2.5 27 209 2.8 27.920 562 8.2 34.8 591 7.9 35.6 576 7.7 34.7 566 7.6 35.421 700 10.2 45 703 9.4 44.9 721 9.6 44.3 747 10 45.422 29 4.4 49.4 313 4.2 49.1 310 4.1 48.4 316 4.2 49.623 327 4.8 54.1 367 4.9 54 332 4.4 52.9 324 43 53.924 270 3.9 58 311 4.1 58.1 272 3.6 56.5 306 4.1 5825 310 4.5 62.6 299 4 62.1 296 3.9 60.4 282 3.8 61.826 217 3.2 65.7 209 2.8 64.9 203 2.7 63.2 190 2.5 64.327 228 3.3 69 206 2.7 67.7 224 3 66.1 190 2.5 66.828 212 3.1 72.1 245 3.3 70.9 256 3.4 69.6 233 3.1 69.929 249 3.6 75.7 296 3.9 74.9 271 3.6 73.2 270 3.6 73.530 156 3.3 78 164 2.2 77.1 213 2.8 76 185 2.5 7631 172 2.5 80.5 177 2.4 79.4 179 2.4 78.4 196 2.6 78.632 147 2.1 82.7 149 2 81.4 156 2.1 80.5 149 2 80.633 131 1.9 84.6 153 2 83.5 162 2.2 82.6 152 2 82.634 102 1.5 86.1 116 1.5 85 129 1.7 84.4 110 1.5 84.135 93 1.4 87.4 103 1.4 86.4 115 1.5 85.9 123 1.6 85.736 81 12 88.6 104 1.4 87.8 105 1.4 87.3 117 1.6 87.337 96 1.4 90 114 1.5 89.3 116 1.5 88.8 113 1.5 88.838 70 1 91 88 1.2 90.5 85 1.1 90 73 1 89.839 77 1.1 92.1 76 1 91.5 80 1.1 91 81 1.1 90.940 48 0.7 92.8 67 0.9 92.4 47 0.6 91.7 58 0.8 91.641 70 1 93.8 73 1 93.3 89 1.2 92.9 79 1.1 92.742 44 0.6 94.5 57 0.8 94.1 60 0.8 93.7 63 0.8 93.543 49 0.7 95.2 50 0.7 94.8 60 0.8 94.5 64 0.9 94.444 43 0.6 95.8 34 0.5 95.2 50 0.7 95.1 50 0.7 95.145 37 0.5 96.4 47 0.6 95.9 32 0.4 95.6 41 0.5 95.646 31 0.5 96.8 27 96.2 96.2 44 0.6 96.1 36 0.5 96.147 28 0.4 97.2 41 96.8 96.8 48 0.6 96.8 31 0.4 96.548 23 0.3 97.6 22 97.1 97.1 26 0.3 97.1 28 0.4 96.949 25 0.4 97.6 30 97.5 97.5 25 0.3 97.5 30 0.4 97.350 13 0.2 98.1 16 97.7 97.7 23 0.3 97.8 21 0.3 97.551 17 0.2 98.4 21 97.9 97.9 24 0.3 98.1 26 0.3 97.952 20 0.3 98.6 13 98.1 98.1 13 0.2 98.3 14 0.2 98.153 15 0.2 98.9 13 98.3 89.3 14 0.2 98.5 18 0.2 98.354 9 0.1 99 13 98.5 98.5 15 0.2 98.7 6 0.1 98.455 10 0.1 99.1 10 98.6 98.6 18 0.2 98.9 19 0.3 98756 8 0.1 99.3 17 98.8 98.8 8 0.1 99 13 0.2 98.857 12 0.2 99.4 8 89.9 98.9 12 0.2 99.2 13 0.2 9958 9 0.1 99.6 8 99 99 3 0 99.2 8 0.1 99.159 4 0.1 99.6 6 99.1 99.1 11 0.1 99.3 7 0.1 99.260 3 0 99.7 9 99.2 99.2 5 0.1 99.4 8 0.1 99.361 1 0 99.7 9 99.4 99.4 4 0.1 99.5 3 0 99.362 3 0 99.7 3 0 99.4 3 0 99.5 7 0.1 99.463 3 0 99.8 5 0.1 99.5 4 0.1 99.6 2 0 99.564 1 0 99.8 3 0 99.5 5 0.1 99.6 3 0 99.565 4 0.1 99.8 5 0.1 99.6 3 0 99.7 4 0.1 99.666 2 0 99.9 4 0.1 99.6 1 0 99.7 1 0 99.667 1 0 99.9 2 0 99.7 4 0.1 99.7 1 0 99.6
Microengine0 Microengine1 Microengine2 Microengine3
125
Table D-4. Receive FIFO buffer Latency Data (2)
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative68 1 0 99.9 1 0 99.7 2 0 99.8 3 0 99.669 1 0 99.9 1 0 99.7 3 0 99.8 1 0 99.670 0 0 99.9 0 0 99.7 2 0 99.8 2 0 99.771 2 0 99.9 1 0 99.7 2 0 99.9 3 0 99.772 0 0 99.9 3 0 99.7 2 0 99.9 1 0 99.773 0 0 99.9 3 0 99.8 1 0 99.9 4 0.1 99.874 1 0 99.9 0 0 99.8 0 0 99.9 0 0 99.875 0 0 99.9 2 0 99.8 0 0 99.9 0 0 99.876 0 0 99.9 0 0 99.8 0 0 99.9 2 0 99.877 0 0 99.9 2 0 99.8 2 0 99.9 1 0 99.878 0 0 99.9 0 0 99.8 1 0 99.9 1 0 99.879 1 0 100 0 0 99.8 0 0 99.9 3 0 99.980 0 0 100 1 0 99.8 0 0 99.9 0 0 99.981 0 0 100 1 0 99.9 0 0 99.9 0 0 99.982 0 0 100 2 0 99.9 0 0 99.9 1 0 99.983 0 0 100 0 0 99.9 2 0 100 0 0 99.984 0 0 100 1 0 99.9 0 0 100 2 0 99.985 0 0 100 0 0 99.9 1 0 100 0 0 99.986 0 0 100 2 0 99.9 0 0 100 1 0 99.987 0 0 100 1 0 99.9 1 0 100 0 0 99.988 0 0 100 0 0 99.9 0 0 100 0 0 99.989 0 0 100 1 0 99.9 0 0 100 1 0 99.990 0 0 100 1 0 100 0 0 100 0 0 99.991 0 0 100 0 0 100 0 0 100 0 0 99.992 0 0 100 1 0 100 0 0 100 1 0 99.993 0 0 100 1 0 100 0 0 100 0 0 99.994 0 0 100 0 0 100 0 0 100 0 0 99.995 0 0 100 0 0 100 0 0 100 0 0 99.996 0 0 100 0 0 100 0 0 100 1 0 10097 0 0 100 0 0 100 0 0 100 0 0 10098 0 0 100 0 0 100 0 0 100 0 0 10099 0 0 100 0 0 100 1 0 100 0 0 100
100 1 0 100 0 0 100 0 0 100101 0 0 100 0 0 100 0 0 100102 0 0 100 1 0 100 0 0 100103 1 0 100 0 0 100104 1 0 100105 0 0 100106 0 0 100107 1 0 100108 0 0 100109 1 0 100
Microengine0 Microengine1 Microengine2 Microengine3
126
Table D-5. Scratchpad RAM Latency Data (1)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative9 379 16.5 16.5 358 15.6 15.6
10 415 18 34.5 448 19.5 3511 187 8.1 42.6 189 8.2 43.212 195 8.5 51.1 194 8.4 51.713 129 5.6 56.7 138 6 57.614 110 4.8 61.5 120 5.2 62.915 80 3.5 65 62 2.7 65.616 67 2.9 67.9 60 2.6 68.217 75 3.3 71.1 73 3.2 71.318 65 2.8 74 64 2.8 74.119 52 2.3 76.2 59 2.6 76.720 49 2.1 78.4 55 2.4 79.121 52 2.3 80.6 36 1.6 80.622 39 1.7 82.3 32 1.4 8223 42 1.8 84.1 45 2 8424 35 1.5 85.7 38 1.7 85.625 35 1.5 87.2 30 1.3 86.926 22 1 88.1 27 1.2 88.127 35 1.5 89.7 34 1.5 89.628 24 1 90.7 33 1.4 9129 26 1.1 91.8 22 1 9230 20 0.9 92.7 17 0.7 92.731 21 0.9 93.6 23 1 93.732 8 0.3 94 12 0.5 94.233 20 0.9 94.8 23 1 95.234 6 0.3 95.1 16 0.7 95.935 15 0.7 95.7 9 0.4 96.336 10 0.4 96.2 6 0.3 96.637 7 0.3 96.5 8 0.3 96.938 10 0.4 96.9 8 0.3 97.339 12 0.5 97.4 6 0.3 97.540 7 0.3 97.7 3 0.1 97.741 11 0.5 98.2 7 0.3 9842 5 0.2 98.4 4 0.2 98.143 6 0.3 98.7 4 0.2 98.344 2 0.1 98.8 5 0.2 98.545 4 0.2 99 8 0.3 98.946 8 0.3 99.3 4 0.2 9947 3 0.1 99.4 0 0 9948 0 0 99.4 4 0.2 99.249 1 0 99.5 2 0.1 99.350 0 0 99.5 1 0 99.351 2 0.1 99.6 1 0 99.452 1 0 99.6 2 0.1 99.553 1 0 99.7 0 0 99.554 0 0 99.7 2 0.1 99.655 1 0 99.7 1 0 99.656 0 0 99.7 0 0 99.657 1 0 99.7 0 0 99.658 0 0 99.7 0 0 99.659 0 0 99.7 1 0 99.760 0 0 99.7 0 0 99.761 2 0.1 99.8 1 0 99.7
Microengine4 Microengine5
127
Table D-5. Scratchpad RAM Latency Data (2)
cycles #ofSmpl % Cumulative #ofSmpl % Cumulative62 2 0.1 99.9 0 0 99.763 0 0 99.9 0 0 99.764 0 0 99.9 1 0 99.765 0 0 99.9 1 0 99.866 0 0 99.9 1 0 99.867 1 0 100 0 0 99.868 0 0 100 0 0 99.869 0 0 100 1 0 99.970 0 0 100 0 0 99.971 0 0 100 0 0 99.972 0 0 100 0 0 99.973 0 0 100 1 0 99.974 0 0 100 0 0 99.975 1 0 100 0 0 99.976 0 0 99.977 0 0 99.978 0 0 99.979 0 0 99.980 0 0 99.981 0 0 99.982 0 0 99.983 0 0 99.984 0 0 99.985 1 0 10086 0 0 10087 0 0 10088 0 0 10089 0 0 10090 0 0 10091 0 0 10092 0 0 10093 0 0 10094 0 0 10095 0 0 10096 0 0 10097 0 0 10098 0 0 10099 0 0 100
100 0 0 100101 0 0 100102 1 0 100
Microengine4 Microengine5
128
Table D-6. FBI CSR Latency Data (1)
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative9 1090 12.5 12.5 817 13.4 13.4 856 14.3 14.3 803 13.3 13.3 959 7.6 7.6 916 7.1 7.1
10 1340 15.3 27.8 959 15.7 29.1 970 16.2 30.5 1012 16.8 30.1 1131 9 16.7 1180 9.2 16.411 705 8.1 35.9 557 9.1 38.2 541 9 39.5 539 8.9 39 801 6.4 23 825 6.4 22.812 678 7.8 43.6 468 7.7 45.9 458 7.6 47.2 552 9.1 48.1 1245 9.9 33 1247 9.7 32.513 531 6.1 49.7 343 5.6 51.5 364 6.1 53.2 327 5.4 53.6 1035 8.3 41.2 1062 8.3 40.814 440 5 54.7 317 5.2 56.7 288 4.8 58.1 306 5.1 58.6 700 5.6 46.8 696 5.4 46.215 349 4 58.7 215 3.5 60.3 206 3.4 61.5 204 3.4 62 588 4.7 51.5 640 5 51.216 309 3.5 62.3 188 3.1 63.3 202 3.4 64.9 191 3.2 65.2 508 4.1 55.5 548 4.3 55.517 251 2.9 65.1 193 3.2 66.5 173 2.9 67.8 157 2.6 67.8 505 4 59.6 465 3.6 59.118 298 3.4 68.5 169 2.8 69.3 166 2.8 70.5 162 2.7 70.4 371 3 62.5 434 3.4 62.519 224 2.6 71.1 190 3.1 72.4 146 2.4 73 144 2.4 72.8 365 2.9 65.4 413 3.2 65.720 243 2.8 73.9 156 2.6 75 134 2.2 75.2 160 2.7 75.5 429 3.4 68.9 426 3.3 69.721 187 21 76 142 2.3 77.3 141 2.4 77.6 141 2.3 77.8 380 3 71.9 407 3.2 72.222 197 2.3 78.3 120 2 79.2 154 2.6 80.1 123 2 79.9 302 2.4 74.3 319 2.5 74.723 163 1.9 80.1 108 1.8 81 102 1.7 81.8 105 1.7 81.6 301 2.4 76.7 294 2.3 7724 176 2 82.2 130 2.1 83.1 93 1.6 83.4 100 1.7 83.3 283 2.3 79 275 2.1 79.225 141 1.6 83.8 115 1.9 85 102 1.7 85.1 97 1.6 84.9 247 2 80.9 223 1.7 80.926 142 1.6 85.4 87 1.4 86.5 86 1.4 86.5 88 1.5 86.3 233 1.9 82.8 221 1.7 82.627 100 1.1 86.5 84 1.4 87.8 81 1.4 87.9 67 1.1 87.4 212 1.7 84.5 228 1.8 84.428 133 1.5 88.1 82 1.3 89.2 81 1.4 89.2 67 1.1 88.5 172 1.4 85.8 172 1.3 85.729 108 1.2 89.3 82 1.3 90.5 84 1.4 90.6 62 1 89.6 166 1.3 87.2 194 1.5 87.330 106 1.2 90.5 61 1 91.5 51 0.9 91.5 69 1.1 90.7 174 1.4 88.6 158 1.2 88.531 94 1.1 91.6 53 0.9 92.4 69 1.2 92.6 60 1 91.7 142 1.1 89.7 125 1 89.532 81 0.9 92.5 51 0.8 93.2 50 0.8 93.5 42 0.7 92.4 112 0.9 90.6 142 1.1 90.633 75 0.9 93.4 41 0.7 93.9 39 0.7 94.1 52 0.9 93.3 129 1 91.6 132 1 91.634 67 0.8 94.1 42 0.7 94.6 37 0.6 94.7 38 0.6 93.9 104 0.8 92.4 113 0.9 92.535 67 0.8 94.9 3 0.5 95.1 34 0.6 95.3 45 0.7 94.6 113 0.9 93.3 108 0.8 93.336 49 0.6 95.5 32 0.5 95.7 34 0.6 95.9 45 0.7 95.4 70 0.6 93.9 90 0.7 9437 46 0.5 96 27 0.4 96.1 38 0.6 96.5 38 0.6 96 73 0.6 94.5 84 0.7 94.738 43 0.5 96.5 27 0.4 96.5 31 0.5 97 27 0.4 96.5 89 0.7 95.2 63 0.5 95.239 26 0.3 96.8 27 0.4 97 24 0.4 97.4 20 0.3 96.8 50 0.4 95.6 67 0.5 95.740 39 0.4 97.2 26 0.4 97.4 20 0.3 97.8 26 0.4 97.2 56 0.4 96 61 0.5 96.241 27 0.3 97.5 17 0.3 97.7 11 0.2 97.9 18 0.3 97.5 65 0.5 96.5 54 0.4 96.642 21 0.2 97.8 15 0.2 97.9 15 0.3 98.2 16 0.3 97.8 50 0.4 96.9 40 0.3 96.943 25 0.3 98.1 10 0.2 98.1 13 0.2 98.4 17 0.3 98.1 37 0.3 97.2 50 0.4 97.344 24 0.3 98.3 16 0.3 98.4 11 0.2 98.6 14 0.2 98.3 44 0.4 97.6 38 0.3 97.645 20 0.2 98.6 18 0.3 98.7 11 0.2 98.8 16 0.3 98.6 38 0.3 97.9 35 0.3 97.946 19 0.2 98.8 12 0.2 98.9 9 0.2 98.9 13 0.2 98.8 30 0.2 98.1 31 0.2 98.147 13 0.1 98.9 9 0.1 99 9 0.2 99.1 11 0.2 99 31 0.2 98.4 35 0.3 98.448 15 0.2 99.1 7 0.1 99.1 8 0.1 99.2 5 0.1 99 32 0.3 98.6 25 0.2 98.649 8 0.1 99.2 6 0.1 99.2 1 0 99.2 5 0.1 99.1 21 0.2 98.8 22 0.2 98.750 13 0.1 99.3 8 0.1 99.3 5 0.1 99.3 9 0.1 99.3 15 0.1 98.9 23 0.2 98.951 4 0 99.4 4 0.1 99.4 0 0 99.3 2 0 99.3 20 0.2 99.1 17 0.1 99.152 9 0.1 99.5 6 0.1 99.5 2 0 99.3 9 0.1 99.5 17 0.1 99.2 13 0.1 99.253 3 0 99.5 3 0 99.6 5 0.1 99.4 2 0 99.5 13 0.1 99.3 14 0.1 99.354 5 0.1 99.6 3 0 99.6 3 0.1 99.5 3 0 99.5 11 0.1 99.4 15 0.1 99.455 6 0.1 99.7 6 0.1 99.7 1 0 99.5 2 0 99.6 11 0.1 99.5 10 0.1 99.556 3 0 99.7 4 0.1 99.8 5 0.1 99.6 1 0 99.6 4 0 99.5 11 0.1 99.557 4 0 99.7 3 0 99.8 4 0.1 99.6 3 0 99.6 6 0 99.6 5 0 99.658 4 0 99.8 3 0 99.9 0 0 99.6 4 0.1 99.7 3 0 99.6 7 0.1 99.659 0 0 99.8 0 0 99.9 0 0 99.7 3 0 99.8 6 0 99.6 8 0.1 99.760 1 0 99.8 1 0 99.9 3 0.1 99.7 1 0 99.8 5 0 99.7 3 0 99.761 1 0 99.8 2 0 99.9 0 0 99.7 0 0 99.8 4 0 99.7 5 0 99.862 1 0 99.8 0 0 99.9 0 0 99.7 4 0.1 99.8 5 0 99.8 8 0.1 99.863 1 0 99.8 0 0 99.9 3 0.1 99.8 1 0 99.9 5 0 99.8 3 0 99.964 1 0 99.8 0 0 99.9 0 0 99.8 1 0 99.9 4 0 99.8 3 0 99.965 2 0 99.9 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.8 0 0 99.966 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.9 5 0 99.9 0 0 99.967 0 0 99.9 1 0 100 3 0.1 99.9 0 0 99.9 1 0 99.9 2 0 99.968 1 0 99.9 1 0 100 0 0 99.9 0 0 99.9 2 0 99.9 1 0 99.969 0 0 99.9 1 0 100 0 0 99.9 1 0 99.9 0 0 99.9 1 0 99.970 0 0 99.9 0 0 100 0 0 99.9 0 0 99.9 0 0 99.9 2 0 99.971 1 0 99.9 0 0 100 0 0 99.9 0 0 99.9 0 0 99.9 1 0 99.972 0 0 99.9 0 0 100 1 0 100 1 0 99.9 1 0 99.9 0 0 99.973 1 0 99.9 0 0 100 0 0 100 0 0 99.9 0 0 99.9 1 0 99.974 0 0 99.9 0 0 100 0 0 100 0 0 99.9 0 0 99.9 1 0 99.975 1 0 99.9 0 0 100 0 0 100 1 0 99.9 0 0 99.9 0 0 99.976 1 0 99.9 0 0 100 0 0 100 0 0 99.9 2 0 99.9 0 0 99.977 1 0 99.9 0 0 100 0 0 100 2 0 100 0 0 99.9 1 0 100
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
129
Table D-6. FBI CSR Latency Data (2)
cycles#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative78 0 0 99.9 0 0 100 0 0 100 0 0 100 2 0 100 1 0 10079 0 0 99.9 1 0 100 0 0 100 0 0 100 0 0 100 1 0 10080 0 0 99.9 0 0 100 0 0 100 0 0 100 0 0 10081 0 0 99.9 0 0 100 0 0 100 1 0 100 0 0 10082 0 0 99.9 0 0 100 1 0 100 0 0 100 0 0 10083 1 0 99.9 0 0 100 0 0 100 2 0 100 0 0 10084 0 0 99.9 0 0 100 0 0 100 1 0 100 0 0 10085 2 0 100 0 0 100 0 0 100 0 0 100 0 0 10086 0 0 100 1 0 100 0 0 100 0 0 100 0 0 10087 1 0 100 0 0 100 0 0 100 0 0 10088 0 0 100 0 0 100 1 0 100 0 0 10089 0 0 100 0 0 100 0 0 100 0 0 10090 0 0 100 0 0 100 0 0 100 0 0 10091 0 0 100 0 0 100 0 0 100 0 0 10092 0 0 100 0 0 100 0 0 100 0 0 10093 0 0 100 0 0 100 0 0 100 0 0 10094 0 0 100 0 0 100 0 0 100 0 0 10095 0 0 100 0 0 100 0 0 100 0 0 10096 0 0 100 0 0 100 0 0 100 0 0 10097 0 0 100 0 0 100 0 0 100 0 0 10098 0 0 100 0 0 100 0 0 100 0 0 10099 1 0 100 0 0 100 0 0 100 0 0 100
100 0 0 100 0 0 100 0 0 100 0 0 100101 1 0 100 0 0 100 0 0 100 0 0 100102 0 0 100 0 0 100 0 0 100103 0 0 100 0 0 100 0 0 100104 0 0 100 0 0 100 1 0 100105 0 0 100 0 0 100 0 0 100106 0 0 100 0 0 100 0 0 100107 0 0 100 0 0 100 1 0 100108 0 0 100 0 0 100 0 0 100109 0 0 100 0 0 100 0 0 100110 0 0 100 0 0 100 0 0 100111 0 0 100 1 0 100 0 0 100112 0 0 100 0 0 100113 0 0 100 0 0 100114 0 0 100 0 0 100115 0 0 100 0 0 100116 0 0 100 0 0 100117 0 0 100 0 0 100118 0 0 100 0 0 100119 0 0 100 0 0 100120 0 0 100 0 0 100121 1 0 100 0 0 100122 0 0 100123 0 0 100124 0 0 100125 0 0 100126 0 0 100127 0 0 100128 0 0 100129 0 0 100130 0 0 100131 0 0 100132 0 0 100133 0 0 100134 0 0 100135 0 0 100136 0 0 100137 0 0 100138 0 0 100139 0 0 100140 0 0 100141 1 0 100142 0 0 100143 1 0 100
Microengine4 Microengine5Microengine0 Microengine1 Microengine2 Microengine3
130
Table D-7. Hash Latency Data
cycles #ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative#ofSmpl % Cumulative32 260 10.4 10.4 247 9.9 9.9 280 11.2 11.2 267 10.7 10.733 261 10.4 20.9 316 12.6 22.5 309 12.4 23.6 311 12.4 23.134 305 12.2 33.1 275 11 33.5 249 10 33.5 243 9.7 32.935 280 11.2 44.3 286 11.4 45 299 12 45.5 300 12 44.936 209 8.4 52.6 193 7.7 52.7 182 7.3 52.8 196 7.8 52.737 168 6.7 59.4 196 7.8 60.5 179 7.2 59.9 216 8.6 61.338 144 5.8 65.1 131 5.2 65.8 133 5.3 65.3 127 5.1 66.439 129 5.2 70.3 117 4.7 70.5 130 5.2 70.5 122 4.9 71.340 146 5.8 76.1 118 4.7 75.2 131 5.2 75.7 117 4.7 7641 124 5 81.1 126 5 80.2 131 5.2 81 130 5.2 81.242 88 3.5 84.6 86 3.4 83.7 80 3.2 84.2 82 3.3 84.543 56 2.2 86.9 78 3.1 86.8 74 3 87.1 62 2.5 8744 63 2.5 89.4 54 2.2 89 57 2.3 89.4 60 2.4 89.445 41 1.6 91 51 2 91 48 1.9 91.3 43 1.7 91.146 37 1.5 92.5 43 1.7 92.7 34 1.4 92.7 32 1.3 92.447 29 1.2 93.7 30 1.2 93.9 34 1.4 94 27 1.1 93.448 27 1.1 94.8 30 1.2 95.1 27 1.1 95.1 29 1.2 94.649 18 0.7 95.5 21 0.8 96 24 1 96.1 28 1.1 95.750 15 0.6 96.1 16 0.6 96.6 12 0.5 96.6 14 0.6 96.351 11 0.4 96.5 6 0.2 96.8 14 0.6 97.1 22 0.9 97.252 25 1 97.5 12 0.5 97.3 17 0.7 97.8 11 0.4 97.653 7 0.3 97.8 10 0.4 97.7 12 0.5 98.3 9 0.4 9854 13 0.5 98.3 7 0.3 98 12 0.5 98.8 17 0.7 98.655 8 0.3 98.6 10 0.4 98.4 5 0.2 99 9 0.4 9956 10 0.4 99 5 0.2 98.6 65 0.2 99.2 5 0.2 99.257 4 0.2 99.2 8 0.3 98.9 4 0.2 99.4 1 0 99.258 3 0.1 99.3 6 0.2 99.2 5 0.2 99.6 5 0.2 99.459 3 0.1 99.4 3 0.1 99.3 1 0 99.6 5 0.2 99.660 7 0.3 99.7 5 0.2 99.5 0 0 99.6 5 0.2 99.861 1 0 99.8 5 0.2 99.7 3 0.1 99.7 1 0 99.962 1 0 99.8 3 0.1 99.8 2 0.1 99.8 0 0 99.963 0 0 99.8 0 0 99.8 0 0 99.8 0 0 99.964 0 0 99.8 1 0 99.8 0 0 99.8 0 0 99.965 0 0 99.8 1 0 99.9 0 0 99.8 0 0 99.966 1 0 99.8 1 0 99.9 1 0 99.8 1 0 99.967 0 0 99.8 0 0 99.9 0 0 99.8 0 0 99.968 0 0 99.8 0 0 99.9 2 0.1 99.9 1 0 10069 0 0 99.8 0 0 99.9 1 0 100 1 0 10070 1 0 99.9 1 0 100 0 0 10071 0 0 99.9 0 0 100 0 0 10072 0 0 99.9 0 0 100 0 0 10073 0 0 99.9 1 0 100 0 0 10074 1 0 99.9 1 0 10075 0 0 99.976 0 0 99.977 1 0 10078 0 0 10079 0 0 10080 1 0 100
Microengine0 Microengine1 Microengine2 Microengine3
131
Appendix E: Multithreading Example
Figure E-1. Multithreading example
132
Appendix F: Theoretical Throughput Calculation for IP
Packets
Table F-1. Theoretical Throughput of IP Packets
Media64-byte PPS(46-byte IPpacket)
594-byte PPS(576-byte IPpacket)
1518-byte PPS(1500-byte IPpacket)
Mixture (avg406-byte) PPS(avg 388-byteIP packet)
100MbpsEthernet 148,810 20,358 8,127 29,343
Gigabit 1,488,095 203,583 81,274 293,42710GigabitEthernet 14,880,952 2,035,831 812,744 2,934,272
OC-3 POSCRC- 16 348,491 31,681 12,256 46,759
OC-12 POSCRC- 16 1,412,830 128,439 49,688 189,570
OC-24 POSCRC- 16 2,825,660 256,878 99,376 379,139
OC-48 POSCRC- 16 5,651,321 513,756 198,752 758,278
OC-192 POSCRC- 16 22,605,283 2,055,026 795,010 3,033,114
OC-3 POSCRC- 32 335,818 31,573 12,240 46,524
OC-12 POSCRC- 32 1,361,455 128,000 49,622 188,615
OC-24 POSCRC- 32 2,722,909 256,000 99,245 377,229
OC-48 POSCRC- 32 5,445,818 512,000 198,489 754,458
OC-192 POSCRC- 32 21,783,273 2,048,000 793,956 3,017,834ATM OC-3 174,245 26,807 10,890 38,721ATM OC-12 706,415 108,679 44,151 156,981ATM OC-24 1,412,830 217,358 88,302 313,962ATM OC-48 2,825,660 434,717 176,604 627,925ATM OC-192 11,302,642 1,738,868 706,415 2,511,698
Note: These throughput numbers are for IP traffic because of comparative purposes
on different encapsulations. The POS (Packets over SONET) performance
calculations use CRC-16 and CRC-32.
133
I present the methodology for calculating PPS on each media below.
IP over Ethernet
As described in Section 5.6, there is 38bytes protocol overhead per one IP packet.
Maximum theoretical throughput on Ethernet is calculated as follows.
Maximum Packet Per Second (PPS) = Ethernet Data Rate (bps)/ {(18-byte
Ethernet header and trailer + IP packet size + 12-byte IFG + 8-byte Preamble/SFD)
x 8}
IP over SONET
First of all, we have to consider pure data rate including media and protocol
overhead. An OC-1 (STS-1) frame consists of 9 rows by 90 columns of 8 bit bytes (9 x
90 x 8 = 6480 bits/frame). Frames are sent at a rate of 8000 frames per second (125
microsecond frame length). Therefore, the gross data rate (i.e. total bandwidth) of
an OC-1 frame is 6480bits x 8000 frames/sec =51.84 Mbps. Next, an OC-3 (or
STM-1) frame consists of 9 rows by 270 columns of 8 bit bytes (9 x 270 x 8 = 19940
bits/frame). The gross data rate is 19940 x 8000 frames/sec = 155.52 Mbps. An
OC-12 (or STM-4) frame consists of 9 rows by 1080 columns of 8 bit bytes (9 x 1080 x
8 = 77760 bits/frame). The gross data rate therefore is 77760 x 8000 frames/sec =
622.080 Mbps. An OC-24 (or STM-8) frame consists of 9 rows by 2160 columns of 8
134
bit bytes (9 x 2160 x 8 = 155520 bits/frame). The gross data rate therefore is 155520
x 8000 frames/sec = 1244.160 Mbps. An OC-48 (or STM-16) frame consists of 9 rows
by 4320 columns of 8 bit bytes (9 x 4320 x 8 = 311040 bits/frame). Hence, the gross
data rate is 311040 x 8000 frames/sec = 2488.320 Mbps. Finally, an OC-192 (or
STM-64) frame consists of 9 rows by 17280 columns of 8 bit bytes (9 x 17280 x 8 =
1244160 bits/frame). The gross data rate is 12444160 x 8000 frames/sec = 9953.280
Mbps.
Secondary, what we have to do is to calculate SONET data rate, which takes
media overhead off. In OC-1, the first 3 columns contain the transport overhead,
which includes the section overhead and the line overhead. The remaining 87
columns are called the synchronous payload envelope (SPE), which contain the path
overhead and payload. Path overhead is 1 column by 9 rows, leaving 86 columns for
payload. As a result, the SONET data rate is directed by next equation.
SONET Data Rate = (90col – 3 transport overhead – 1 path overhead) x 9 row x
8bits/byte x 8000fps) = 49.536 bps
Similarly, the SONET Data Rate for other classes can be calculated as follows.
Table F-2 presents summarized data for Gross data rate and SONET data rate.
Payload bps = (Nx (90 col - 3 transport_col )- 1 Path_col) x 9 row x 8 bits/byte x
8000 fps) where N=OC-N
135
Table F-2 SONET and SDH Multiplex Rates
OpticalCarrier
SDH STMSignal
GrossData Rate(Mbps)
SONET Data Rate (Mbps)(Takes both 3 columnTransport and 1 column PathOverhead in SPE into account)
OC-1 51.84 49.536OC-3 STM-1 155.52 147.76OC-12 STM-4 622.08 599.04OC-24 STM-8 1244.16 1198.08OC-48 STM-16 2488.32 2396.16OC-192 STM-64 9953.28 9584.64
In addition, protocol overhead encapsulating data should be taken into account.
In table F-1, the performance is calculated based on CRC16 and 32. Hence, Packet
over SONET (POS) maximum PPS is calculated as follows.
CRC-16 header = 7bytes = 1byte delimiter + 4bytes HDLC + 2bytes CRC16
CRC-32 header = 9bytes = 1byte delimiter + 4bytes HDLC + 4bytes CRC32
POS max PPS = OC-N SONET Data Rate / (IP packet size + CRC header bytes)
IP over ATM
The throughput of IP packet traffic over ATM is calculated as follows.
AAL5 PDU size = IP packet size + 8-byte SNAP + 4-byte AAL5 overhead
+ 4-byte CRC = IP packet size + 16-byte
ATM cell count = roundup (AAL5 PDU size/ 48-byte)
136
Total cell bytes = cell count x 53-byte cell size
ATM max PPS = OC-n SONET Payload Data Rate / Total cell bytes
Note: AAL: ATM Adaptation Layer, PDU: Protocol Data Unit, SNAP: Subnetwork
Access Protocol
137
Appendix G. Instruction Set of Other NPs
Table G-1. MIPS-1 Instruction Set (Integer only)
Instruct ion Descr ipt ionAr i thmet ic and Logical Instruct ionsadd dest, src1, src2 Addaddi dest, src1, imm Add immediateaddu dest , s rc1 , s rc2 Add unsignedaddiu dest, src1, imm Add immediate unsignedsub dest, src1, src2 Subt rac tsubu dest, src1, src2 Subt rac t uns ignedand dest, src1, src2 Andandi dest, src1, imm And immediatediv src1, src2 Dividedivu src1, src2 Divide unsignedmul t s rc1 , s rc2 Mult iplymul tu s rc1 , s rc2 Mult iply unsignedor dset, src1, src2 O rori dest, src1, src2 Or immedia tenor des t , s rc1 , s rc2 Norsl l dest, src1, src2 Shif t lef t logicalsl lv dest, src1, src2 Shif t lef t logical variablesra dest, src1, src2 Shi f t r ight Ar i thmet icsrav dest, src1, src2 Shi f t r ight Ar i thmet ic var iablesr l des t , s rc1 , s rc2 Shif t r ight logicals r lv des t , s rc1 , s rc2 Shift r ight logical variablesub dest, src1, src2 Subt rac tsubu dest, src1, src2 Subt rac t uns ignedxor dest, src1, src2 Xorxor i dest, src1, imm Xor immediateBranch and Jump Ins t ruc t ionsbeq src1, src2, of fset Branch on equalbne src1, src2, of fset Branch on not equalbgez src, offset Branch on greater than equal zerobgezal src, offset Branch on greater than equal zero and l inkbgtz src, of fset Branch on greater than zeroblez src, offset Branch on less than equal zerobgezal src, offset Branch on greater than equal zero and l inkbl tz src, of fset Branch on less than zerobltzal src, of fset Branch on less than zero and l inkj labe l Jumpjal label Jump and l inkja l r s rc Jump and l ink reg is terj r s rc Jump reg is te rr f e Return f rom except ionComparison Instruct ionss l t des t , s rc1 , s rc2 Set less thanslt i dest, src2, imm Set less than immediatesl tu dest , src1, src2 Set less than unsignedsl t iu dest , src1, imm Set less than u immediate uns ignedLoad and Store Inst ruct ionslb dest, imm(src) Load by telbu dest, imm(src) Load unsigned bytelh dest, imm(src) Load hal fwordlhu dest, imm(src) Load unsigned halfwordlw dest , imm(src) Load wordlwl dest , imm(src) Load word lef tlwr dest, imm(src) Load word r ightsb src1, imm(src2) S to re by tesh src1, imm(src2) Store hal fwordsw src1, imm(src2) S to re wordswl src1, imm(src2) Store word lef tswr src1, imm(src2) Store word r ightConstant-Manipulat ing Inst ruct ionslui dest, imm Load upper immediateMiscel laneous Instruct ionsmfhi dest Move f rom himf lo dest Move f rom lomth i dest Move to h imt lo dest Move to lomfcz dest Move f rom coprocessor z
138
Table G-2. PowerNP Picoprocessor Opcodes Instruction DescriptionAlu OpecodeArithmetic Immediate(AluOp)add result = opr1 + opr2add w/carry result = opr1 + opr2 + Csubtract result = opr1 - opr2subtract w/carry resul t = opr1 - opr2 -Cxor result = opr1 XOR opr2and result = opr1 AND opr2or result = opr1 OR opr2shift left logical result = opr1 << opr2, fill with 0sshift right logical result = fill with 0, opr1 >> opr1shift right arithmetic result = fill with S, opr1 >> opr2rotate right result = fill with opr1, opr1 >> opr2compare opr1 - opr2test opr1 AND opr2not result = NOT(opr1)transfer result = opr2Logical Immediate Opecode(LOp)xor result = opr1 XOR opr2and result = opr1 AND opr2or result = opr1 OR opr2test opr1 AND opr2Compare Immediate OpcodeCompare Immediate(1) Compare odd GPR register with Immediate dataCompare Immediate(2) Compare even GPR register with Immediate dataCompare Immediate(3) Compare word GPR register with Immediate data zero extendCompare Immediate(4) Compare word GPR register with Immediate data sign extendLoad immediate OpcodeLoad immediate(1) Load odd halfword GPR from immediate dataLoad immediate(2) Load even halfword GPR from immediate dataLoad immediate(3) Load word GPR from immediate data zero extendedLoad immediate(4) Load word GPR from immediate data 0 postpendLoad immediate(5) Load word GPR from immediate data 1 extendedLoad immediate(6) Load word GPR from immediate data 1 postpendLoad immediate(7) Load word GPR from immediate data sign extendedLoad immediate(8) Load GPR byte 3 from low byte of immediate dataLoad immediate(9) Load GPR byte 2 from low byte of immediate dataLoad immediate(10) Load GPR byte 1 from low byte of immediate dataLoad immediate(11) Load GPR byte 0 from low byte of immediate dataArithmetic/Logical Register OpcodeBit clear uses and (AluOp)Bit set uses or (AluOp)Bit flip uses xor (AluOp)Count Leading Zeros Opcode
Count Leading Zeros returns the number of zeros from left to r ight unti l the f irst 1-bit isencountered
Control Opcodesnop executes one cycle of time and doesn't change any state
exit terminates the current instruction stream.* The CLP wil l be put intoan idle state and made available for a new dispatch
test and branch tests a single bit within a GPR register
branch and link performs a condition branch*, adds one to the value of the currentprogram counter, and placesit onto the program stack
return performs a conditional branch* with the branch destination being thetop of the program stack
branch register performs a conditional branch*branch pc relative performs a conditional branch*branch reg+off performs a conditional branch*Data Movement Opcodesmemory indirect transfers data between a GPR and a coprocessor array via a logical
address in which the base offset into the array is contained in a GPR
memory add indirecttransfers data between a GPR and a corocessor data entity (scalar orarray) by mapping the coprocessor via a logical address into the baseaddress held in the GPR indicated by opcode
memory direct transfers data between a GPR and a coprocessor array via a logicaladdress that is specified in the immediate portion of the opcode
scalar access transfers data between a GPR and a scalar register via a logicaladdress that consists of a coprocessor number and a scalar register
scalar immed writes immediate data to a scalar register via a logical address that iscompletely specified in the immediate portion of the opcode
transfer quadword transfers quadword data from one array loction to another using oneinstruction
139
Table G-3. PowerNP Picoprocessor Condition Codes for conditional branch
zero array zeroes out a port ion of an array wi th one instruct ionCoprocessor Execut ion Opcodes
execute direct ini t iates a coprocessor command in which al l of the operat ionarguments are passed immediately to the opcode
execute indirect ini t iates a coprocessor command in which the operat ion argumentsare a combination of a GPR register and an immediate f ield
execute direct condi t ional s imi lar to the execute d i rect opcode except that the i t can be issuedcondit ionally based on the cond f ield
execute indirect condi t ional s imi lar to the execute ind i rect opcode except that the i t can beissued condit ional ly based on the cond f ield
wait synchronizes one or more coprocessorswait and branch synchronizes wi th one coprocessor and branch
Note: Condi t ional branch* depends on Condi t ion codes. Data Movement Opcodes support 23 opt ions ofdirection, size, extension , and fi l l.
Condition codes0 equal or zero1 not equal or not zero2 carry set3 unsigned higher4 unsigned lower or equal5 unsigned lower or equal6 always7 signed positive8 signed negative9 signed greater or equal
10 signed greater than11 signed less than or equal12 signed less than13 overflow14 no overflow
140
List of Figures
2-1. Internet Hierarchy 6
3-1. Router Processing on Fast Path 8
3-2. Pseudo Code of Receive Thread Main Loop 20
3-3. Pseudo Code of Transmit Scheduler Main Loop 23
3-4. Pseudo Code of Transmit Thread Main Loop 27
4-1. Architecture of the Intel IXP1200 31
4-2. Microengine Architecture 32
4-3. FBI Unit Architecture 35
4-4. Ready Bus and Ready Flags 36
4-5. Microengine Pipeline 37
4-6. Memory Access flow 39
4-7. Branch pipeline example with class3 instruction 41
4-8. Branch pipeline example with class2 instruction (case1) 42
4-9. Branch pipeline example with class2 instruction (case2) 43
4-10. Branch pipeline example with class1 instruction 43
4-11. Branch pipeline example with deferred branch instruction 44
4-12. Branch pipeline example with guess instruction 46
4-13. Branch pipeline example with guess and deferred branch options 46
5-1. Instruction Mix for Receiving Packets 50
5-2. Instruction Mix for Transmitting Packets 51
5-3. Instruction Mix for Overall Processing 52
141
5-4. SDRAM Latency 53
5-5. SRAM Latency (unlocked) 55
5-6. SRAM Latency (locked) 55
5-7. Executing, Aborted, Stalled, and Idle ratio on 64bytes Workload 57
5-8. Executing, Aborted, Stalled, and Idle ratio on 594bytes Workload 58
5-9. Executing, Aborted, Stalled, and Idle ratio on 1518bytes Workload 58
5-10. Executing, Aborted, Stalled, and Idle ratio on Mixture Workload 59
5-11. CPI for Microengines 60
5-12. Throughputs (bounded) 62
5-13. Throughputs (unbounded) 64
6-1. NetVortex Context Switch Mechanism 68
6-2. Coprocessor Execution Opcode Example (Wait Opcode) 71
A-1. Receive Ready Check 76
A-2. Receive Request Issue 77
A-3. Receive Packet Status Acquisition 78
A-4. Packet Buffer Allocation 79
A-5. Port Fail/Error Check 79
A-6. MAC Packet Header Acquisition 80
A-7. Parse Packet 81
A-8. Ethertype Field Classifier 82
A-9. Filter 82
A-10. Port information Acquisition for Filter 86
A-11. IP Header Acquisition 87
142
A-12. IP Version Check 87
A-13. IP Header Check & Modify 88
A-14. IP verify 89
A-15. IP Modify 90
A-16. Packet Discard 91
A-17. Trie Lookup 91
A-18. Next_Trie_Search for Trie Lookup 95
A-19. Write Modified IP and Ether Header 95
A-20. Transmit Assignment Read 96
A-21. Transmit Packet Link List Read 97
A-22. Transmit Packet Link List Update 97
A-23. Transmit Port Vector clear 98
A-24. Last Packet Transfer 98
A-25. Set Transmit Control Word 99
A-26. TFIFO Validate 99
A-27. Transmit Port Vector Modify 1 0 0
A-28. Transmit Packet Transfer 1 0 1
D-1. Receive FIFO buffer Latency 111
D-2. Scratchpad RAM Latency 111
D-3. FBI CSR Latency 1 1 2
D-4. Hash unit Latency 1 1 2
E-1. Multithreading example 1 3 1
143
List of Tables
3-1. Frequently occurred packets in the real Internet 12
3-2. Workloads of Fixed size packets 15
3-3. Workload of Internet Packets Mixture 15
4-1. Instructions Categorized by Class 40
4-2. Guess Branch Instructions 45
6-1 NetVortex extended Instruction set 67
6-2. C-5 Coprocessor Zero Register Definitions 70
B-1. Microengine Instruction Set 1 0 2
C-1. Instruction Mix Data for 64bytes packets 1 0 6
C-2. Instruction Mix Data for 594bytes packets 107
C-3. Instruction Mix Data for 1518bytes packets 1 0 8
C-4. Instruction Mix Data for Mixture packets 1 0 9
C-5. Memory Access per cycle 1 1 0
D-1. SDRAM Latency Data 1 1 3
D-2. SRAM Latency (unlocked) Data 1 1 7
D-3. SRAM Latency (locked) Data 1 2 0
D-4. Receive FIFO buffer Latency Data 1 2 4
D-5. Scratchpad RAM Latency Data 1 2 6
D-6. FBI CSR Latency Data 1 2 8
D-7. Hash Latency Data 1 3 0
F-1. Theoretical Throughput of IP Packets 1 3 2
144
F-2. SONET and SDH Multiplex Rates 1 3 5
G-1. MIPS-1 Instruction Set (Integer only) 1 3 7
G-2. PowerNP Picoprocessor Opcodes 1 3 8
G-3. PowerNP Picoprocessor Condition Codes for conditional branch 1 3 9