Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | oliver-nash |
View: | 216 times |
Download: | 0 times |
Implementing PCI I/O Virtualization StandardsMike Krause and Renato RecioPCI SIG IOV Work Group Co-chairs
Today’s Approach To IOVVirtualization Intermediaries (VI) and hypervisors are used to safely share IO
Under this approach1 or more System Images share the PCI device through a VI
Virtualization enablers are not needed in the either the Root Complex (RC) or PCIe Device
The VI is involved in all IO transactions and performs all IO Virtualization Functions
VI Based PCI Device Sharing Example
System Images share the adapter through a
VI.MMIO and DMA
operations go through the VI
PCIe RC
PCIe Device
PCIe Port
F
Today’s PCIe Device with one or more
Functions.The Device may not be cognizant at all
that it is being shared
Hypervisor
SI SIVI
Performance Of Today’s IOV
VI based IOV adds path length on every IO operationNative IOV significantly improves performance
For the example above Native IOVdoubles throughput and reduces latency by up to half
Several factors are increasing the virtualization use (e.g., more cores per socket, customer simplification requirements, …)
Making Native IOV even more important in the future
Native IOVVI Based IOV
Native IOVVI Based IOV
Through-put*Latency*
*Source: Self-Virtualized I/O: High Performance, Scalable I/O Virtualization in Multi-core Systems; R. Himanshu, I. Ganev, K. Schwan - Georgia Tech and J. Xenidis - IBM
PCI SIG IOV Overview
PCI SIG is standardizing mechanisms that enable PCIe Devices to be directly shared, with no run-time overheads
Single-Root IOV – Direct sharing between SIs on a single systemMulti-Root IOV – Direct sharing between SIs on multiple systems
PCI-SIG IOV Specification covers “north-side” of the Device“PCI IOV Usage Models and Implementations” session will cover examples of how PCI IOV specifications can be used tovirtualize PCIe Devices
Hypervisor
SI SI
Hypervisor
PCIe Topology
PCIe RC
FC SAN Ethernet LAN
PCIe RC Blade
PCIe Multi-Root IOV
VISI SI
Hypervisor
PCIe RC
PCIe Single-Root IOVVI
PCIe IOV Capable Device
PCIe PortPCIe MR-IOV
Capable FC DevicePCIe MR-IOV
Capable Enet Device
FC SAN
SI SIVI
Blade
TerminologySystem Image (SI)
SW, e.g., a guest OS, to which virtual and physical devices can be assigned
Virtual Intermediary (VI)Performs resource allocation, isolation, management and event handling
PCIM – PCI ManagerControls configuration, management and error handling of PFs and VFsMay be in SW and/or Firmware.May be integrated into a VI
Translation Agent (TA )Uses ATPT to translates PCI Bus Addresses into platform addresses
Address Translation and Protection Table (ATPT)
Validates access rights of incoming PCI memory transactions.Translates PCI Address into platform physical addresses
Processor
Memory
Hypervisor
VI
PCIe Root Complex
ATPTTA
PCIeDevice
F
PCIeDevice
F
PCIeSwitch
VISR-PCIM
PCIe Port PCIe Port
SISI
PCIe Port PCIe Port
Terminology… Continued
Single-Root IOV (SR-IOV) - A PCIe hierarchy with one or more components that support the SR-IOV CapabilityMulti-Root Aware (MRA) - A PCIe component that supports the MR-IOV capabilityMulti-Root IOV (MR-IOV) - A PCIe Topology containing one or more PCIe Virtual Hierarchies
Virtual Hierarchy (VH) - A portion of an MR Topology assigned to a PCIe hierarchy, where each VH has its own PCI Memory, IO, and Configuration Space
PCIe Fabric
PCIe MRADevice
PCIeDevice
PCIe SR-IOV MRA Device
Hypervisor
SI SI
Hypervisor
PCIe RC PCIe RC Blade
PCIe Multi-Root IOV
VISI SI
Hypervisor
PCIe RC
PCIe Single-Root IOVVI
PCIe IOV Capable Device
PCIe Port
SI SIVI
Blade
VH0
VHNVH1
Terminology… ContinuedAddress Translation Cache (ATC)
Cache of recent translations
Physical Function (PF)Function that supports SR-IOVUsed to manage VFs
Virtual Function (VF)Function that supports SR-IOV and shares resources with the PF it associated with
Base Function (BF)Function that supports MR-IOVUsed to manage PFs and VFs on MRA Devices
Internal
Routing
PF0
VF1
VF2
VFN
:
ATC1
Resources1
ATC2
Resources2
ATCN
ResourcesN
Internal
Routing
PF0
VF1
VF2
VFN
:
BF0
PF0
VF1
VF2
VFN
:
….
PCIe Port
PCIe Port
PCIe SR-IOV
Capable Device
PCIe MR-IOV
Capable Device
PCI IOV Related Mechanisms
Function Level Reset (FLR)Alternative Routing ID Interpretation(ARI)Address Translation ServicesSingle-Root IO VirtualizationMulti-Root IO Virtualization
FLR And ARI
FLR - Provides Function level granularity on resetsAll software readable state must be cleared by an FLRAll outstanding transactions associated with the Function referenced by the FLR must be completed when the FLR is returned as completed
ARI - Extends Function number field from 3 to 8 bitsAllows up to 256 Functions or VFsper PCIe Device
VI SI SI
PCIeDevice Routing
Hypervisor
PCIe Root Complex ATPT
PCIe Port
ConfigMgt
Function1Function2Function3
PCIe Topology
FLRs
PCIe DeviceFC HBA
VF1 VF2 VF3
Routing
PCIe Port
PF0 FC Port FC Port
SI
FC SAN FC SAN
Address Translation Services
ATS is used to cache PCIe Memory Address translations in a PCI Device.Consists of three new PCIe transactions
Request Translation Transaction – Used by a PCIe Device to request a translation of an untranslated addressTranslated DMA Transaction – Used to perform a DMA that references a translated addressInvalidate Translation Transaction – Used to invalidatea previously exposed translated address
PCIe Device
VF1 VF2 VF3
PCIe Device with Address Translation
Services
ATC1 ATC2 ATC3
Routing
PCIe Port
PF0
DownstreamPort
Hypervisor
PCIe Root Complex ATPT
PCIe Topology
SISI SIVI
PCI Single-Root IOV
Overview
PCIe SR-IOVCapableDevice
Today’s PCI Device
Function 0 is requiredOverview of Function Attributes
Each Function has a its own configuration and PCIe memory address spaceUp to 8 PCI Functions with unique configuration space / BAR / etc.
ARI Capability enables up to 256 Functions to be supportedSupport INTx, MSI, MSI-X or combination of MSI and MSI-XFunction dependencies through vendor specific mechanismsCannot be directly shared by SIsVendor specific mechanisms to associateFunctions to “South-side” resources
Downstream
PortInterna
l Routin
g
Internal
Routing
F0
F1
F2
FN
:
ATC1
Resources1
ATC2
Resources2
ATCN
ResourcesN
PCIe Port
SR-IOV Device Overview
Function 0 is requiredOverview of VF attributes
Each VF has a its own configuration and PCIe memory address spaceVFs share a contiguous PCIe memory address space
Up to ~216 Virtual FunctionsARI enables up to 256
IOV enables additional Bus Numbers to be associated Function dependencies defined through standard mechanism
Support MSI and MSI-XCan be directly shared by SIs
Vendor specific mechanisms to associate Functions to “South-side” resources
PCIe SR-IOVCapableDevice
Downstream
PortInterna
l Routin
g
Internal
Routing
PF0
VF1
VF2
VFN
:
ATC1
Resources1
ATC2
Resources2
ATCN
ResourcesN
PCIe Port
SR-IOV Device DiscoveryEach PF must have an SR-IOV Extended Capability structure
A Device may have a mix of Functions and PFs
Each Function and PF consumes one Routing Identifier (RID)
For a Device thatisn’t ARI capable, Functions and
PFs must be in the first 8 functions;
is ARI capable, number of Functions and PFs supported is as defined
in the base specification31 20
19 16
15 0
Offset
Next Capability Pointer Cap. Vrsn. Capability ID 00h
SR IOV Capabilities 04h
: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
VF Discovery - Part 1The TotalVFs field is used to discover the number of Active VFs that a PF could have.
The InitialVFs field is used to discover the number of Active VFs that a PF initially has.
Note for a Device that isn’t MR capable*: 1 ≤ InitialVFs = TotalVFs
31 20
19
16
15 0
Offset
: :
TotalVFs (RO) InitialVFs (RO) 0Ch
: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
BAR DiscoveryEach PF has its own independent set of BARs in its standard configuration space and an MSE bit for those BARsThe VFs share a BAR set and
have an MSE bit that controls the memory space of all the VFs
The BAR set that is shared by all the VFs resides in the PF’s SR-IOV capabilities
31 20
19 16
15 0
Offset
: :VF BAR0 (RW) 20h
: :VF BAR5 (RW) 34h
: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
Supported Page SizesSupported Page Sizes is used to discover the page sizes supported by the VFs associated with the PF
When this field is read, the PF must return the page sizes it can supportThis field will be used during the IOV configuration phase to align VF BAR apertures on system page boundaries (more later)31
2019 16
15 0
Offset
: :
Supported Page Sizes (RO) 18h
: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
Configuring VFsVFs for a PF are enabled by writing NumVFs and then Setting the “VF Enable” bit
When VF’s are enabled, the PCIe Device must associate NumVFsworth of VFs with the PF
If VF Migration Capable and VF Migration Enabled set, then NumVFs must be
1 ≤ NumVFs ≤ TotalVFsOtherwise NumVFs must be
1 ≤ NumVFs ≤ InitialVFs
31 20
19
16
15 0
Offset
: :SR IOV Capability (RO) 04h
SR IOV Status SR IOV Control (RW) 08hReserved NumVFs (RW) 10h
: :
VF Enable
VF Migration Enable
VF Migration Capable
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
Configuring The VFs’ BARsSystem Page Size defines the page size the system will use
Value of System Page Size must be one of the Supported Page SizesSystem Page Size is used by the PF to align the MMIO aperture defined by each BAR to a system page boundary
VF BARs behave as in PCI 3.0 Spec’s PCI BARs, except that a VF BAR describes the aperture for multiple VFs (see next page)31
2019 16
15 0
Offset
: :System Page Sizes (RW) 1Ch
VF BAR0 (RW) 20h: :
VF BAR5 (RW) 34h: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
PF And VF BAR SemanticsThe memory aperture required for each VF BAR can be determined by writing all “1”s and then reading the VF BAR
Behaves as in the base PCI SpecThe address written into each VF BAR is used by the Device to set the starting address for that BAR on the first VFThe differences between VF BARs and PCI 3.0 Spec BARs are
For each VF BAR, the memory space associated with the 2nd and higher VFs is derived from the starting address of the first VF and the memory space apertureThe VF BAR’s MMIO space is not enabled until VF Enable and VF MSE have been set
PCIe MMIO Space
PF BAR0 MMIO Aperture
VF 1 BAR0 MMIO Aperture
VF 2 BAR0 MMIO Aperture
:VFN BAR0 MMIO
Aperture
PF Config Space
:
BAR0 (RW)
:PF IOV Config
Space
:
VF BAR0 (RW)
:
BAR1 VF N SA = BAR1 VF 1 SA + N x (BAR1 VF MA) - 1
where BAR1 VF 1 SA is the address written into VF BAR1and MA is the memory aperture of VF BAR1.
VF Discovery – Part 2The values in NumVFs and VF ARI Enable affect the values for First VF Offset and VF Stride*
First VF Offset and VF Stride are defined by the DeviceFollowing is the equation for determining the RID of VF N:VF N = [PF RID + VF Offset + (N-1) * VF Stride] Modulo 216
where all arithmetic used is 16 bit unsigned dropping all carries
31 20
19 16
15 0
Offset
: :
VF Stride (RO) First VF Offset (RO) 14h
: :
PCI Root
SI
A
SI
B
SR
PCIM
PF0
PF3
PF1
F2
PCIe Topology
*Note: After setting NumVFs and VF ARI Enable, SR-PCIM can read these two fields to determine how many busses will be consumed bt the PF’s VFs
PF And VF BAR SemanticsPF and VF RIDs must not overlap given any valid NumVFs setting across all PFs of a DeviceAs in base, an SR-IOV Device captures Bus # from any Type 0 config request
VFs may reside on a different bus number as associated PF
If the switch above the Devicesupports ARI Forwarding bit, RIDs are interpreted as 8 bit Bus # and 8 bit Device #doesn’t supports ARI, RIDs are interpreted as BDF#
Bus # can be used to assign VFs,but PFs must be on 1st Bus assigned to Device
RID # Func.
: :0200 PF 0
0201 PF 1
0202 VF 0,1
0203 VF 1,1
0204
0205
0206 VF 0,2
0207 VF 1,2
0208
0209
020A VF 0,3
020B VF 1,3
020C
020D
020E VF 0,4
: :0310
0311
0312
VF 0,40
: :
PCIe RID Number
PF 0 IOV Config Space
:
Reserved
NumVFs = 0050x
VF Stride
= 0004x
VF Offset = 0002x
:
PF 0 RID = 0200
PF 1 IOV Config Space
:
Reserved
NumVFs = 0003x
VF Stride
= 0004x
VF Offset = 0002x
:
PF 1 RID = 0201
Reset Mechanisms
Three reset mechanisms are supported
Conventional Reset – Resets all PF and VF state
FLR that targets a VF – Resets a single VF
FLR that targets a PF – Resets a PF and its associated VFs
Conventional Reset
A Fundamental or Hot Reset to an SR-IOV Device shall cause all Functions, PFs, and VFs context to be reset to their original, power-on stateIf a PF has its VFs enabled and a Fundamental or Hot Reset is issued to the Device, the Device must reset all PF and VF state, eg:
The PF must disable its SR-IOV capabilities andreverts back to being a PCI FunctionSettable SR-IOV capabilities (e.g., NumVFs) are reset to default valuesMSE and BME are both offAll BAR values used by the PF and VFs are indeterminate.All interrupts are disabled
FLRs To VFs And PFs
An FLR that targets a VF must be supported
Software may use FLR to reset a VF
An FLR that targets a VF mustNot affect the VFs existence (e.g. it still consumes a RID)
Not affect any address assigned to it. That is, the VF’s BAR registers and MSE are unaffected by FLR
An FLR that targets a PF must be supported
Software may use an FLR to reset a PF
An FLR that targets a PF mustReset the PF’s SR-IOV Extended Capabilities
The VFs are no longer allocated or enabled
PCI Multi-Root IOV
Overview
Today’s PCIe Topology
Strict TreeRoot Ports Connect to Switches
Switches Connect to Devices
Roots Ports can also connect to Devices
Switches can also connect to more Switches
Single Software Management EntityRuns above the Root
Implicit Message RoutingRoute to Root
Broadcast from Root
MR Switch
Single Switch MR Topology
SI
MR Device MR Device MR DevicePCIe Device
Root BRoot A Root C Root D
FC SAN Ethernet LAN
Ethernet LAN
FC SAN
Two Switch MR Topology
RightLeft
MR Device
FC SAN
MR Device MR Device
Root BRoot A Root C Root D
Ethernet LAN
Ethernet LAN
PCIe Device
FC SAN
TLP Labeling
TLP Prefix on MR LinksAfter Seq #, before TLP Hdr
Sent / Resent with TLP
Prefix ContainsVH Number (link local)
VL Number (flow control)
Global Key (error checking)
Added at MR IngressFirst MR Component seen
Dropped at MR EgressLast MR Component seen
Sequence #STP
TLP Prefix
PCIe TLP Header
TLP Data (optional)
ECRC (optional)
LCRC
END
ECRC
LCRC
SR-IOV Device Overview
MR-IOVCapableDevice
PF0
PF0 VFN
PF0 VF1Internal
Routing
PCIe Port
: DownstreamPort
Internal
Routing
MR-IOVCapableDevice
MR-IOV Device Overview
Each Virtual Hierarchy (VH) has a full PCIe address spaceConfiguration, Memory and I/O Space
Up to 28 Virtual HierarchiesVH0 is used to Manage MR-IOV features of the Device
BF in VH0 manages associated PFs and VFsOther Functions optional in VH0
VH1 to VHmax have same PFs at the same Function #sEach PF in each VH has it’s own values for InitialVFs, TotalVFs, NumVFs, VF Stride and VF Offset
VH1 PF0
VH1 PF0 VFN
VH1 PF0 VF1Internal
Routing
PCIe Port
: DownstreamPort
Internal
Routing
VHK PF0 VF1
VHK PF0 VFM
:
VHK PF0VH0 BF0
MR-IOV Device Configuration
Initially, only VH0 is enabled
Within VH0, MR-PCIM enumerates Config SpaceLocate all BFs by looking for MR-IOV Capability
For each Device located…In BF0
Determine MaxVH, set NumVHEnable additional VLs (if available)Configure VL arbitration (optional)Configure per VH mapping of VCID to VLConfigure per VH Global Keys
In every BFIn each VH, provision VC Resources (optional)If hardware present, configure Initial VF mapping
PCI Specification Schedule
ATSSR-IOVMR-IOV
Schedules
ATS Specification released 3/8/2007SR-IOV Specification
Following PCI SIG Specification ProcessDraft 0.7 CompletedDraft 0.9 May, 2007Version 1.0 early 3Q/2007
MR-IOV SpecificationFollowing PCI SIG Specification Process.Draft 0.5 CompletedDraft 0.7 2Q/2007Draft 0.9 late 2Q/2007Version 1.0 late 3Q/2007
Call To Action
Please participate in the PCI-SIG Specification Development Process
For more information please go to www.pcisig.com
Thank you for attending
Workgroup MembersAMD
Broadcom
Emulex
HP
IBM
IDT
Intel
LSI Logic
Microsoft
NextIO
Neterion
Nvidia
PLX
Qlogic
Stargen
Sun
VMWare
Questions
MR Topics
Back-ups
Switch TLP Routing
Three step routing decision:MR Input Map
{ Input Port, Input VH# } { VS, VS Input Port }MR-PCIM Manages
VS PCIe Map{ VS Input Port, TLP Hdr } { VS Output Port }VH Software Manages using PCIe rules
MR Output Map{ VS, VS Output Port } { Output Port, Output VH# }MR-PCIM Manages
Hot Reset
PCIe Hot Reset is broadcast downstream
Switch Upstream port all Downstream portsUses Phy Training SequenceHandshake to detect link partner received
MR Reset is per VHVS Upstream port all VS Downstream portsUses Reset DLLPHandshake to detect link partner receivedLink stays up, other VHs unaffected
Flow Control
PCIe uses Virtual Channels (VC) for QoS, traffic isolation, etc.MR Extends this to Virtual Links (VL)MR adds per VH, VL flow control
Traffic GatePCIe: sufficient VC Credits to send TLPMR: sufficient VL Credits to send TLPand sufficient VH, VL Credits to send TLP
Management / Authorization
Devices are managed using VH0
Switches are managed using the Upstream port of any Authorized VS
Multiple VS may be simultaneously authorized
Supports MR-PCIM failover
At failover, new MR-PCIM remaps VH0 so it can mange MR Devices
Virtual Hot Plug
Virtual Hot Plug is implemented in each Downstream Port of each VS
Software in VH sees PCIe Slot Control Regs
MR-PCIM controls virtual “buttons & lights”
e.g. pushing the virtual Attention Button, detecting virtual slot power state, …
Physical Hot-Plug is implemented in each Physical Port
Managed by MR-PCIM
Optional (same as PCIe)
Power Management
Each PF has a power stateif all PFs in low power state, link can go to low power state (like PCIe multi-function rules)
PM_PME Messages sometimes trigger WAKE#
e.g., some Root powered itself off but a shared device below it was only virtually powered off (Device still being used by other VHs)
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this
presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.