Copyright © 2009, PCI-SIG, All Rights Reserved 1
Optimizing PCIe® Performance in PCs & Embedded Systems
Optimizing PCIe® Performance in PCs & Embedded Systems
Mike AlfordGennum
Mike AlfordGennum
Copyright © 2009, PCI-SIG, All Rights Reserved 2PCI-SIG Developers Conference
Disclaimer
Copyright © 2009, PCI-SIG, All Rights Reserved 2
Presentation Disclaimer: All opinions, judgments, recommendations, etc. that are presented herein are the opinions of the presenter of the material
and do not necessarily reflect the opinions of the PCI-SIG®.
Some information in this presentation refers to a specification still in the development process.
Copyright © 2009, PCI-SIG, All Rights Reserved 3PCI-SIG Developers Conference
AgendaLatency
Link layerPacket layerDriver/SWSystem level
DMA engine architectureConventionalPCIe® optimized
Root Complex CharacteristicsMeasured vs. theoretical
Avoiding hazards and race conditionsInterrupt controller design
Copyright © 2009, PCI-SIG, All Rights Reserved 4PCI-SIG Developers Conference
OverviewObjective of the presentation is to explore some of the important elements of system performance WRT endpoint HW and SW design
How should they work and perform?How do they perform in actual implementations
Best practices in endpoint designFPGA and ASIC
Copyright © 2009, PCI-SIG, All Rights Reserved 5PCI-SIG Developers Conference
LatencyThere are many forms or layers of latency in systems
System Level, End-to-End Latency
Application SW
SwitchRCEndpoint
Memory
Driver SWInterrupt Service Latency
DMA PCIe Core100’s of ns
10’s to 100’s of μs
> ms
(Packet Level Latency)
(Driver Level Latency)
(Application Latency)
Link Level Latency (PM etc.) 10’s of ns
Copyright © 2009, PCI-SIG, All Rights Reserved 6PCI-SIG Developers Conference
Latency Impact ofPower Management
Lane 1Lane 0
Lane 3Lane 2
Lane 0
1 Packet over 4 lanes
1 Packet over a single lane
PM Scenario 1: Negotiate down # of lanes
PM Scenario 2: Aggregate Packets
Lane 1
Lane 3Lane 2
Additional Latency
Link A
Link
ALi
nk B
Link B
Copyright © 2009, PCI-SIG, All Rights Reserved 7PCI-SIG Developers Conference
Packet Latency: Pre-PCIPrior to PCI, the cost of polling IO was about the same as polling memory
In the order of several hundred nsThis assumption was accepted in the architecture of IO devices and driver software
CPU(486)
Local Bus (32 bit/33MHz)
RAM IO
Copyright © 2009, PCI-SIG, All Rights Reserved 8PCI-SIG Developers Conference
Packet Latency: PCIWith PCI, system aggregate bandwidth improved while IO latency degraded
Memory access has lower latency cost compared to IO
Especially if IO is below multiple layers of bridges
Copyright © 2009, PCI-SIG, All Rights Reserved 9PCI-SIG Developers Conference
Packet Latency: PCI Express®PCI Express provides even more options for system expansion that can increase IO latency
PCIe Switch
CPU
Root Complex
DRAM
GPU
IO PCIe Repeater
PCIe Repeater
PCIe Switch
IO
PCIe Cable
PCIe cable delay = ~43ns/10m
Switch latency ~200ns or more
Copyright © 2009, PCI-SIG, All Rights Reserved 10PCI-SIG Developers Conference
Polling Latency Across the System
1.50
920.01
2.50
1540.77
66.71 71.34
1
10
100
1000
10000
Cache Read Memory Read PCIe Endpoint Read
Ave
rage
Rea
d La
tenc
y (n
s)
System 1System 2
Data Measured from 2 Different PC Systems
Copyright © 2009, PCI-SIG, All Rights Reserved 11PCI-SIG Developers Conference
Interrupt Latency FactorsMore cache layers (L1/L2/L3)
Cache miss probability decreases but penalty for a miss increases
– Results in less predictable interrupt latency (larger max/min ratio)
– Will tend to get worse in the future
Deeper processor pipelines result in longer interrupt latencies due to larger amount of context information that must be flushed or stored
Copyright © 2009, PCI-SIG, All Rights Reserved 12PCI-SIG Developers Conference
Interrupt Latency ExperimentUse a test endpoint card to generate an interrupt under SW controlRepeatedly measure the time interval between assertion of the interrupt and the de-assertion from within the interrupt handler to generate a histogram
Vary system loading to observe the effect on interrupt latency
T
Interrupt Assert by
HW
Interrupt De-assert
by ISR
Copyright © 2009, PCI-SIG, All Rights Reserved 13PCI-SIG Developers Conference
Interrupt Latency Example
1
10
100
1000
10000
100000
10 40 70 100
130
160
190
220
250
280
310
340
370
400
430
460
490
520
550
580
610
640
670
700
730
760
790
820
850
880
910
940
970
1000
Latency (Microseconds)
Latency (System Idle) Max=33usLatency (>90% CPU Load) Max=18.2msLatency (20% CPU Load) Max=23.4ms
Data Measured from PC System Running Windows XP
Note: “Max” values are the largest latencies measured under those conditions.
Copyright © 2009, PCI-SIG, All Rights Reserved 14PCI-SIG Developers Conference
Simple DMA Service ScenarioHost Peripheral
Set up DMA Transfer
Execute DMA Transfer
Set up DMA Transfer
Execute DMA Transfer
If this latency is too long, then the peripheral will starve (example: dropped video frames, audio breakup)
Copyright © 2009, PCI-SIG, All Rights Reserved 15PCI-SIG Developers Conference
Service Latency vs. Buffer Size
0
50
100
150
200
250
300
350
400
450
500
0 100 200 300 400 500
Throughput (MB/s)
Buf
fer S
ize
(KB
)
1000 us750 us500 us375 us250 us100 us50 us25 us
Example: 1 stream of 1080p60 video
Copyright © 2009, PCI-SIG, All Rights Reserved 16PCI-SIG Developers Conference
DMA Engine ArchitectureWith PCIe, DMA can take advantage of the full duplex nature of the linkDMA can consist of multiple upstream DMA threads and multiple downstream DMA threads
However, only one thread in each direction can be active on the link at any one instant
Copyright © 2009, PCI-SIG, All Rights Reserved 17PCI-SIG Developers Conference
Conventional Multi-channel DMA Approach
PHYLLTL
Up DMAm
Up DMA1
Up DMA2
AHB Ctl.
Down DMAn
Down DMA1
Down DMA2
AH
B
AHB Arb.
PCIe Core
Interconnects like AHB are a poor choice because they don’t provide full duplex data transfer
Copyright © 2009, PCI-SIG, All Rights Reserved 18PCI-SIG Developers Conference
PCIe Optimized DMA Approach
MU
XD
ECO
DE
Copyright © 2009, PCI-SIG, All Rights Reserved 19PCI-SIG Developers Conference
Scatter/Gather Controller Example
SG EngineSequencer Registers
DPTRRARB
SYS_ADDR_HSYS_ADDR_L
XFER_CTL
EVENT_SETEVENT_CLREVENT_EN
EVENT
CSR
DMA Control
Program Control
Event Control/Status
Sequencer Control/Ststus
Descriptor RAM
Instruction Decoder
External Conditional
InputsJMP Condition
SelectHost Access to Descriptor RAM
DataAddress
Downstream DMA Master
Upstream DMA Master
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO Status
Upstream Data
Downstream Data
Host Access to SG Registers
Application
Upstream Channel Select
MU
XD
EC
OD
E
Downstream Channel Select
Application Interaction
with EVENTInterrupt Output
FIFO
FIFO
Copyright © 2009, PCI-SIG, All Rights Reserved 20PCI-SIG Developers Conference
Example SG List Entry
XFER_CTLSYS_ADDR_HSYS_ADDR_L
Specifies xfer count, direction, stream ID, etc.
64 bit host memory address
Copyright © 2009, PCI-SIG, All Rights Reserved 21PCI-SIG Developers Conference
SG Engine Instruction SetLoad, Store, Add System Address
Used to manipulate the system address register– System address register used to specify the host source/destination
address for DMA xferLoad XFER_CTL
Pushes a command into either the upstream or downstream DMA controller
Load, Store, Add RA/RB RegistersUsed to manipulate the indexing/counting registers RA and RB
Conditional JumpUsed for polling FIFO status and for looping
“Event” assertionUsed as a semaphore mechanism and to signal interrupts
Copyright © 2009, PCI-SIG, All Rights Reserved 22PCI-SIG Developers Conference
Simplified DMA MainControl Sequence
Channel 0 Ready for Servicing?
No
Service Channel 0
Initialize
Yes
Channel 1 Ready for Servicing?
No
Service Channel 1Yes
Channel n-1 Ready for Servicing?
No
Service Channel n-1Yes
Copyright © 2009, PCI-SIG, All Rights Reserved 23PCI-SIG Developers Conference
Root Complex CharacteristicsMax Payload is 128B on most systems
256 on some server class chipsets and newer desktop512 seen in some of the latest at PlugfestSpec allows up to 4KB
Max read request size supportedTypical is 512BSpec allows up to 4KB
Typical read completion packet sizeMost systems use a cache line based fetching mechanism resulting in 64B cache aligned packetsSome RC chipsets provide read combining feature that will opportunistically combine multiple sequential 64B packets togetherSpec allows up to 4KB
Virtual channelsTypical RC/switches/FPGA cores support only the default VC0Spec allows for up to 8 hardware VC
Copyright © 2009, PCI-SIG, All Rights Reserved 24PCI-SIG Developers Conference
Outstanding Reads vs. Performance
0
100
200
300
400
500
600
700
800
1 Outstanding 2 Outstanding 3 Outstanding 4 Outstanding
Number of Outstanding Reads
Thro
ughp
ut (M
B/s
)
System 1System 2System 3System 4
Measured Results for Endpoint DMA Read to 4 Different RC(PCIe 1.x x4 link)
Copyright © 2009, PCI-SIG, All Rights Reserved 25PCI-SIG Developers Conference
Completion Ordering
Copyright © 2009, PCI-SIG, All Rights Reserved 26PCI-SIG Developers Conference
Completion Ordering Summary
FIFO OrderLowest latency for single stream trafficFewer outstanding requests needed to sustain throughput
Out of OrderLowest latency for multi stream traffic
– Better when you have multiple streams with small FIFOs• Can use one outstanding request per FIFO and thus avoid re-ordering
logic
Actual systemsSome use FIFO order some OoO
– Endpoint generally needs to support both unless RC is always known
Typical PCIe IP cores do not re-order for you– Requires additional logic
Copyright © 2009, PCI-SIG, All Rights Reserved 27PCI-SIG Developers Conference
Link EfficiencyThe specified link rate of 2.5GT/s (PCIe 1.x), 5GT/s (PCIe 2.x), 8GT/s (PCIe 3.0) is not all usable
Example:– X1 link at PCIe 2.x is 312.5 MB/s of raw bandwidth– Subtract 8b/10b encoding = 250MB/s– Subtract link layer traffic (ack/nack, replay, FC updates etc.)– Subtract packet overhead
Packet overhead STP
Sequence Number
PHY/PCSDLL
Header
LCRC
END
1 Byte
2 Bytes
12 Bytes (32 bit requests and completions)16 Bytes (64 bit requests)
Data PayloadBetween 4 Bytes and MAX_PAYLOAD size
TLP 4 Bytes
1 Byte
Copyright © 2009, PCI-SIG, All Rights Reserved 28PCI-SIG Developers Conference
Link Efficiency vs.Payload Size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 8 16 32 64 128 256 512 1K 2K 4K
Payload Size (Bytes)
Link
Effi
cien
cy
16 Byte Header12 Byte Header
Copyright © 2009, PCI-SIG, All Rights Reserved 29PCI-SIG Developers Conference
RC/Endpoint Performance Under System Load
Experiment:X4 endpoint doing 600-700MB/s DMA in each direction to host memory via RC when system is ~2% CPU load (including DMA driver)What happens to endpoint throughput when host is stressed?
– Scenarios:• 100% CPU load, memory stress test, high IO traffic, GPU to host
memory traffic– Test results show a worst case degradation of only ~6% on a variety
of PC MB
ConclusionTypical PC RC memory contention is minimal Rule of thumb for sustainable throughput
– 150MB/s per PCIe lane per direction (double for PCIe 2.x)
Copyright © 2009, PCI-SIG, All Rights Reserved 30PCI-SIG Developers Conference
Avoiding Hazards and Race Conditions
The definition of endpoint control/status registers needs to be multi-core/multi-thread friendly
Think like a driver/OS programmer (or at least have them review your spec)Avoid registers that cause state change on a read
– Can be a problem for bridges/processors that do caching/prefetchingWhere reads cause state changes, avoid packing multiple bit fields into the same naturally aligned DW (32 bits)
– Example: 8 bit Read FIFO from 2 different UARTs packed into the same DW– Why?:
• Byte lane selection not available for block operations and prefetching• Some processors don’t have byte lane information for reads
Use IOV like constructs such as providing multiple views of the register space to different processors/processes
Impact on performancePoor control mechanisms can result in ugly SW workarounds that can seriously impact performance
Copyright © 2009, PCI-SIG, All Rights Reserved 31PCI-SIG Developers Conference
Interrupt Controller DesignProblem:
Use of IP blocks can result in poorly thought out interrupt controller that SW engineers will hate you for
– Example: lack of a single register to determine source of a shared interrupt– Common practice is to “OR” together all on-chip interrupt sources and use
that to generate INTx or MSI/MSI-XBest practices
Single read only status register where the status bits of all IP blocks are always readable (even if interrupt forwarding/generation is disabled)
– No need to poll multiple registers to determine the source of an interrupt– You always have the option of polling using a single register rather than
employing interruptsWhen multiple interrupts (hard or messaged) are to be generated have an enable (or mask) register per interrupt outputMake sure it is possible to clear each interrupt source separately
– Clearing one interrupt source will not cause another to be cleared unintentionally
Make sure you support INTx (legacy interrupt mode) in addition to MSI or MSI-X
– Win XP and below do not support MSI/MSI-X
Copyright © 2009, PCI-SIG, All Rights Reserved 32PCI-SIG Developers Conference
Example Interrupt Controller7
ProgrammableI/O Pins
GPIO Cont roller
GPIO Cont rolRegisters
65
43
21
0
INT7
MSIGenerat ion
Int errupt Cont roller
INT6INT5INT4INT3INT2INT1INT0
INT_CFG0 InterruptConf igurat ion Register(1 per interrupt output)INT_STAT
InterruptStatus
Register(1 only)
On-chipInterruptSources
InterruptSources
GPIO OutputRegisters
Copyright © 2009, PCI-SIG, All Rights Reserved 33PCI-SIG Developers Conference
Tips for Scalable Endpoint Design
Assume that packet latency and context switching latency will increase in future systems
Don’t rely on fast interrupt handling to keep your data pipes filledAvoid interrupts altogether if possible
– Use low frequency pollingRely on DMA with large scatter/gather lists that don’t have to be updated very often
Assume that throughput AND latency will increase in future systems
Have host driver poll on host memory based semaphores rather than polling the IO subsystem
– Have IO subsystem write semaphores into host memory
Copyright © 2009, PCI-SIG, All Rights Reserved 34PCI-SIG Developers Conference
SummarySW/HW interaction should be designed to be relatively insensitive to packet level latency
High sensitivity to packet level latency may be a sign of poor HW/SW interaction
Assume that interrupt latency will be wildly variableFor isochronous data (example: video) rely on large SG lists so that the endpoint can operate for a long period without interrupt servicing
Take advantage of the bidirectional nature of the PCIe link
Avoid internal busses that are not bidirectional – Or have separate upstream/downstream buses to feed the
transaction layer
Copyright © 2009, PCI-SIG, All Rights Reserved 35PCI-SIG Developers Conference Copyright © 2009, PCI-SIG, All Rights Reserved 35
Thank you for attending the PCI-SIG Developers Conference 2009
For more information please go to www.pcisig.com