Date post: | 10-Feb-2017 |
Category: |
Documents |
Upload: | phungthien |
View: | 226 times |
Download: | 0 times |
I/O Virtualization and System Acceleration in POWER8
Michael Gschwind
IBM Power Systems
Industry Trends Generate New Opportunities
1
10
100
2004 2006 2008 2010 2012 2014
Price/Performance
Processors
Semiconductor
Technology
Applications and Services
Firmware, Operating
System and Hypervisor
System Stack
Systems Management &
Cloud Deployment
Systems Acceleration &
HW/SW Optimization
Workload Acceleration
Services Delivery Model
Advanced Memory Tech.
Network & I/O Accel.
System stack innovations are required to
continue Cost/Performance improvements
© 2015 International Business Machines Corporation
System Design ca. 2015
• Consolidation in the Cloud
– Virtual Machine Aggregation
– Processor Virtualization
• I/O Aggregation
– I/O Virtualization
– I/O Isolation
• Maintain Performance growth for critical workloads
– Under Power constraints
– Under Cost constraints
– Under Area constraints
© 2015 International Business Machines Corporation
Innovation Through Open Standards
• Consolidation in the Cloud
– Virtual Machine Aggregation
– Processor Virtualization
• I/O Aggregation
– I/O Virtualization
– I/O Isolation
• Maintain Performance growth for critical workloads
– Under Power constraints
– Under Cost constraints
– Under Area constraints
Power Instruction
Set Architecture
I/O Design
Architecture
Coherent Accelerator
Interface Architecture
© 2015 International Business Machines Corporation
I/O Design Architecture
• I/O Virtualization
– IO address translation
– Single level translation support contiguous PowerVM partitions
– Hierarchical translation support Linux/KVM partition memory
• Partition data isolation
– Separate I/O address spaces for partitions
• Partition fault isolation
– Fault management domains based on Partitionable Endpoints (PE)
© 2015 International Business Machines Corporation
Partitioning and Managing the I/O Space: Partitionable Endpoints
© 2014 International Business Machines Corporation
PCIe Host Bridge
Partitionable
Endpoint State
Partitionable
Endpoint State…
PCIe Host Bridge
Partitionable
Endpoint State
Partitionable
Endpoint State…
IOV Adapter
PF VF VF VF VF
Adapter
Port
Port
Po
rt
Po
rt Switch
Adapter Adapter
Adapter Adapter
Port
Port
Po
rt
Po
rt Switch
Port
Port
Po
rt
Po
rt Switch
Partitionable
Endpoints
Managing Partitionable Endpoints
• Many I/O functions tracked using Partitionable Endpoints
– MMIO Load/Store address domains
– DMA I/O bus address domains and TCEs
– Interrupts
– Ordering of transactions per PE
– PE Error and Reset domains
• Enhanced RAS capabilities
– Enhanced I/O Error Handling (EEH)
– Partition isolation
POWER8
…
PCIe
Validate Translate
Interrupts
Memory
PHB PHB
© 2015 International Business Machines Corporation
I/O DMA memory translation
• Translate I/O DMA memory addresses to system memory address
– Isolation of partitions
– Create contiguous view of non-contiguous data
– Enable 32 bit devices for 64 bit systems
• Processor chip contains cache of recently accessed tables
– Full tables stored in system memory
© 2015 International Business Machines Corporation
translate
validated I/O
bus DMA
address
validate I/O
bus DMA
address
validate the
translated I/O
bus DMA
addressvalidated and translated
I/O bus DMA address
I/O bus DMA address
Partitionable Endpoint Management
• DMA validation
– “Trusted” Requester ID (RID): 16-bit bus/device/function number
– “Untrusted” 64-bit address – received from device driver
• Interrupts tracked in system memory
• PEs impacted by I/O error identified via RID
– SRIOV Physical Function fault PEs of Phys. & all Virt. Adapter Functions
DMA Validation
TCE cache
PE#
PE
#
PE# for operation
Interrupt Tracking
PE#
PE
#
Interrupt
Vectors
and state
Interrupt Source
Controller state
machine
Error State Tracking
Off
ch
ip
(sys
tem
me
mo
ry)
On
ch
ip
PE#
Index
PE#
vector
array
Error message
state machine
PE# Vector
DMA Address RID DMA Address RIDData RID
© 2015 International Business Machines Corporation
Coherent Accelerator Processor Interface (CAPI)
• Integrated in every Power8
system
• Builds on a long history of IBM
workload acceleration
• Integrates with big-endian and
little-endian accelerators
© 2015 International Business Machines Corporation
Microprocessor Trends
Lo
g (
Perf
orm
an
ce)
Single Thread
More active Transistors, higher frequency
More active Transistors, higher frequency
More active Transistors, higher frequency
Multi-Core
Hybrid
Special Purpose
© 2015 International Business Machines Corporation
Heterogeneous System Challenges
• The 4 ‘P’s of System Design
• Programmer Productivity
• Realize accelerator Performance benefits
• Portability: Investment protection for applications
• Partitioning for multi-user systems: processes, partitions
© 2015 International Business Machines Corporation
Workload-optimized acceleration with coherent accelerators
• Attached accelerators
– Accelerate functions that do not fit traditional CPU model
– Heterogeneous System Architecture
• Coherent integration in system architecture
– Data sharing
– Programming
– Performance
© 2015 International Business Machines Corporation
Application Acceleration
• Fine-grained data sharing
coherent, shared memory
• Accelerator-initiated data accesses/transfers
coherent, shared memory
• Pointer identity
shared addressing
• Flexible synchronization
symmetric, programmable interfaces
© 2015 International Business Machines Corporation
CAPI Acceleration overcomes Device Driver Deceleration
Typical I/O Model Flow:
Flow with Coherent Model:
Total ~13µs for data prep
Total ~0.36µs
© 2015 International Business Machines Corporation
Workload-optimized acceleration
• On-chip integrated accelerators (SoC design)
– Compute accelerator (Cell BE)
– Compression (P7+)
– Encryption (P7+)
– Random number generation (P7+)
– …
• SoC design offers highest integration, but…
– Requires new chip design for accelerator
– Long time to market
– Requires very high volumes
Cell BE
POWER7+
© 2015 International Business Machines Corporation
CAPI: Coherent Accelerator Processor Interface
• Integrate accelerators into system arch.
– Modular interface
– Third-party high value-add components
• Standardized, layered protocol
– architectural interface
– functional protocol
– PCIe signaling protocol
• Create workload-optimized innovative solutions
– Faster time to market
– Lower bar to entry
– Variety of implementation options
FPGAs, ASICs
Coherence Bus
proxy
PSL
POWER8
* Power Service Layer© 2015 International Business Machines Corporation
CAPI accelerator programming
Coherence Bus
proxy
PSL
Virtual Addressing
• Pointer identity: Pointers reference
same object as the host application
• CAPI accelerators work with same virtual
memory addresses as CPU
• CAPI shares page tables and provides
address translation of host application
• Peer-to-peer programming between
CPU and accelerator with
in-memory data sharingVirtualization and Partitioning
• Address translation supports process
isolation Accelerator has access to
application context (only)
• Address translation supports partition
isolation Accelerator has access to
partition data (only)© 2015 International Business Machines Corporation
CAPI accelerator programming
Coherence Bus
proxy
PSL
Hardware Managed Cache Coherence
• No need for memory pinning
• Data fetched by accelerator based on
accelerator application flow
• Accelerator participates in locks
• Low latency communication
© 2015 International Business Machines Corporation
CAPI Accelerator Virtualization
• Dedicated Model
– Accelerator assigned to single process in a partition
– Binding via Operating System (process) and Hypervisor (partition)
• Time-quantum shared programming model
– Protocol-controlled model
• Accelerator-directed shared programming model
– networking model (select context based on incoming data)
Coherence Bus
proxy
PSL
work queues
© 2015 International Business Machines Corporation
Coherent acceleration of data sharing
• Offload data synchronization and data
transfers
– No need to invalidate data before
initiating I/O transfers
– Accelerator feeds itself avoid use
of high-function CPU as data mover
• Available translation, transfer and
synchronization bandwidth scales with
parallelism
– As more accelerators are used,
available resources scale up
– Avoid CPU becoming serial
bottleneck0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16SPEs
no
rmalized
execu
tio
n t
ime
compute
address management
Cost avoided by
system-wide
page tables
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 4 8 16
SPEs
Tim
e (
no
rma
lize
d)
compute
cache mgmt
Cost avoided by
cache-coherent
transfers
[Gschwind, ICCD 2008]© 2015 International Business Machines Corporation
Coherent accelerator programming flexibility
• Garbage collection accelerator
– Explore boundaries of acceleration
– Study accelerator programmability
• Pointer identity: advanced data structures
– Autonomous traversal of data
• Self-paced memory access
– Simplifies data management
– No callbacks to request data
– Zero-copy data access
• Handle complex access patterns [Cher & Gschwind, VEE 2008]
Normalized area×delay product
© 2015 International Business Machines Corporation
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI
• Read/write commands issued via APIs to eliminate 97% of path length
• Saves 20-30 cores per 1M IOPS
20K
instructions
reduced to
500
Application
Library
POSIX Async
I/O Style API
Shared
Memory Work
Queue
aio_read()
aio_write()
strategy ( )
Pin buffers, Translate,
Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap,
unpin,Iodone scheduling
Disk and Adapter DD
strategy ( ) iodone ( )
File System
iodone ( )
LVM
© 2015 International Business Machines Corporation
CAPI Unlocks the Next Level of Performance for Flash
Identical hardware with 2
different paths to data
FlashSystem
Conventional
I/O CAPI
POWER S822L >5x better IOPS
per HW thread>2x lower latency
0
20,000
40,000
60,000
80,000
100,000
120,000
0
50
100
150
200
250
300
350
400
450
500
IOPS per Hardware Thread Latency (µs)
© 2015 International Business Machines Corporation
Load Balancer
500GB Cache
Node
10Gb Uplink
POWER8 Server
Flash Array w/ up
to 56TB
Differentiated NoSQL
(POWER8 + CAPI + FlashSystem)
Infrastructure Attributes
- 192 threads in 4U server drawer
- 56 TB of flash per 2U drawer
- Shared Memory & cache for dynamic tuning
- Elimination of I/O and network overhead
- Cluster solution in a box
Today’s NoSQL in memory (x86)
Infrastructure Requirements
- Large distributed (Scale out)
- Large memory per node
- Networking bandwidth needs
- Load balancing
Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.
WWW10Gb Uplink
WWW
Backup Nodes
500GB Cache
Node500GB Cache
Node500GB Cache
Node500GB Cache
Node
What CAPI Means for NoSQL Solutions
© 2015 International Business Machines Corporation
Summary
• POWER8 delivers advanced virtualization for CPU and I/O
– High-performance I/O virtualization
– Data isolation
– Fault isolation
• POWER8 takes the next step in exploiting system accelerators
– Eliminate overheads inherent in I/O model
– Reduced latency
– Increased throughput
• Collaborative Innovation based on Open Standards
– Both specifications donated to OpenPOWER Foundation by IBM
– Available royalty-free to all OpenPOWER members
© 2015 International Business Machines Corporation