Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | adrian-cunningham |
View: | 214 times |
Download: | 0 times |
Developing Applications For More Than 64 Logical Processors In Windows Server 2008 R2
Terminology Announcing Support for Greater than 64 Logical
Processors Hardware Evolution Application Guidelines
Application Compatibility New Structures and APIs for >64LP
Non-Uniform Memory Architecture (NUMA) Optimizing for Topology
System Tools Intel Enterprise Roadmap The Future of Scale Up
Agenda
Processors can’t continue to get faster and hotter …but Moore’s law still rules
The big iron of today is the commodity server of tomorrow
How does Windows play? Bet on Scale Up? Bet on virtualization and consolidation? Answer: All of the above
The Future Of Server
Processor / Package / Socket: One physical processor – may consist of one or more cores
Core: One processing unit – may consist of one or more logical processor
Logical Processor (LP) / Hardware Thread: One logical computing engine in the OS, application and driver view
NUMA: Non-Uniform Memory Architecture NUMA Node: Group of logical processors and cache that
are in proximity to each other
Group <new>: Logical grouping of up to 64 logical processors
One physical processor – may consist of one or more cores
One processing unit – may consist of one or more logical processors
One logical computing engine in the OS, application and driver view
Non-Uniform Memory Architecture Group of logical processors, cache and
memory that are in proximity to each other I/O is an optional part of a NUMA Node
Logical grouping of up to 64 logical processors
Scale Up Terminology
Scale Up Terminology
128 Logical Processor SystemGroup
(up to 64 logical processors)
NUMA NodeSocket
Core
Logical Processor
Announcing support for up to 256 logical processors Targeting environments with large
single image line-of-business and database applications: SQL Server, SAP, etc.
Refactoring of hot locks in the Windows kernel
Driver stack enlightenment New APIs for >64LP and locality Tools and UI for viewing hundreds
of logical processors and locality
Windows Server 2008 R2
OLTP Scaling 64LP To 128LP
1.7X
128LP64LP
Hardware Evolution
Industry is moving towards segmented processing complexes CPUs share internal caches between cores CPU and memory are a “node”
“Close” versus “far” hardware can be determined easily
Windows Hardware Error Architecture (WHEA)
Server Power Management Core Parking
Virtualization and Hypervisor
Additional Scale Up Features
Segmented specification – “groups” of CPUs CPUs identified in software by Group# : CPU# Allows backward compatibility with 64-bit affinity New applications have full CPU range using new APIs Permits better locality during scheduling than a “flat” specification
Greater Than 64LP Support
Today – Focus on Server and single LOB applications Intel, AMD, HP, Unisys, IBM, NEC, Fujitsu SQL Server, SAP NICs Storage
Tomorrow – Focus expands to client and the larger driver and application ecosystem
Engagements
HP Superdome HW ConfigurationIA64 example – 256 Logical Processors
30 HP SmartArray P600 HBAs (x4 3Gb SAS)
60 HP MSA70 Disk Arrays1440 72GB 2.5” 15Krpm Disks
A 64-core configuration can achieve 200,000 IOps
for 8-64 KB requests
64 dual-core hyper-threaded “Montvale” 9100 1.6 GHz
Itanium2 w/ 24 GB LLC
1 TB Memory,4 dual-port 1GB NICs
Unisys ES7000 HW ConfigX64 Example – 128 Logical Processors
32 dual-core hyper-threaded “Tulsa” 7100 3.4 GHz Xeon
w/ 16 GB LLC24 Emulex LP10002 dual-port FC HBAs
48 HP MSA1000/1500 Disk Arrays1540 36GB 3.5” 15Krpm Disks
256 GB Memory
Perf Benchmark Run On 128P System
video
Legacy applications see only a single group Applications are spread
across groups via round-robin Applications only need changes if
They manage per processor information for the whole system – such as task manager
They are performance critical and need to scale beyond 64 logical processors
Applications that do use processor affinity masks or numbers will still operate correctly
There is zero app compat impact on systems with less than 64 logical processors
Application Compatibility
New StructuresGroup relative processor affinity and processor numbertypedef struct _GROUP_AFFINITY { KAFFINITY Mask; USHORT Group; USHORT Reserved[3];} GROUP_AFFINITY, *PGROUP_AFFINITY; typedef struct _PROCESSOR_NUMBER { USHORT Group; UCHAR Number; UCHAR Reserved;} PROCESSOR_NUMBER, *PPROCESSOR_NUMBER;
By default, processes are assigned in a round-robin fashion across groups An application can override this default
using Process Affinity APIsA process starts in a single group but
can expand to contain threads running on all groups in a machine A single thread can never be assigned to
more than one group at any time
Thread And Process Affinity On >64LP Systems
Applications constrained to a single group unless they explicitly create threads on other groups Most applications benefit by locality of resources
Applications need to use extended APIs to affinitize beyond one processor group Performance is the motivation
Applications must understand what work can run independently from others and assign to other groups Beware, on NUMA systems this can hinder,
not help, performance
Affinitized Applications
Some New Functions
GetActiveProcessorGroupCount
Returns the number of active groups in the system
GetMaximumProcessorGroupCount
Returns the maximum number of groups that the system supports
GetActiveProcessorCount Returns the number of active LPs in a group or in the systemGetMaximumProcessorCount
Returns the maximum number of LPs that a group or the system can support
GetThreadGroupAffinity Returns the current group affinity of the thread (in GROUP_AFFINITY)
SetThreadGroupAffinity Sets the affinity of the thread to a set of LPs within a specified group (in GROUP_AFFINITY)
CreateRemoteThreadEx Enables an application to change the default thread group affinity and specify an ideal LP for a thread
GetNumaNodeNumberFromHandle
Returns the node number associated with a handle (e.g. file and socket handles)
Apps scaling beyond 64 logical processors should be NUMA aware
Future server designs will be NUMA A process or thread can
set a preferred NUMA node This helps with groups assignments
as nodes must not cross groups Also provides a performance
optimization inside a group
NUMA NodesNon-Uniform Memory Architecture
What Is Non-Uniform Memory Access (NUMA) I/O?
A picture is worth a 1000 words!
P1
Cache1
MemA
Node Interconnect
MemBDiskA
P3
Cache3
P4
Cache4
Cache(s)
(0)
(3)
(4)(1)
(7) I/O InitiatorISR
I/O Buffer Home
DPC(2)(6)
(5)
P2
Cache2
DiskB
Locked out for I/O Initiation
Locked out for I/O Initiation
Bad Case Disk Write Unoptimized driver
P1
Cache1
MemA
Node Interconnect
MemBDiskA
P3
Cache3
P4
Cache4
Cache(s)
(3)
(3)
I/O Initiator
ISR DPC
(2)P2
Cache2
DiskB
ISR
(2)
Using NUMA APIsOptimization for Topology
System Topology APIs
Processor topology exposed (user/kernel) NUMA nodes Sockets Cores Logical
Processors (e.g., hyper-threads) Includes details of CPU caches (L0/L1/L2)
Memory topology exposed Device location exposed (e.g.,
network and storage controllers) Determine associated NUMA node(s) via ACPI
Proximity Domains (BIOS support required)
System Tools
Performance Monitor Processor to node relationship information Total system logical processor utilization
Task Manager Average CPU utilization per LP Average CPU utilization for a given node
Resource Monitor
Intel® Enterprise Roadmap
All dates, product features and plans are subject to change without notice.
Intel® Xeon® MP7000 Sequence(Expandable, MC)
Intel® Xeon® DP5000 Sequence(Efficient Performance)
Intel® Xeon® UP3000 Sequence(Entry)
Intel® Xeon® WS5000 Sequence(Workstation & HPC)
Intel® Itanium®
9000 Sequence(Mission Critical)
Future2008
Nehalem ProcessorFuture EX Chipset
Itanium Processor870 / OEM Chipset Future MC Chipset
Kittson
Nehalem ProcessorFuture EN Chipset
Quad/Dual-core Xeon ProcessorIntel® 3200 Chipset
Quad/Dual-core Xeon ProcessorIntel® 7300 Chipset
Quad/Dual-core Xeon ProcessorIntel® 5100 Chipset
Nehalem ProcessorFuture EP Chipset
Quad/Dual-core Xeon ProcessorIntel® 5000 Chipset
Nehalem ProcessorFuture EP Chipset
Quad/Dual-core Xeon ProcessorIntel® 5400 Chipset
Nehalem ProcessorFuture WS Chipset
Dunnington
PoulsonTukwila
PCI Express*
I/OHub
ICH
DMI
Nehalem Nehalem
PCIExpress*
PCIExpress*
Nehalem Based System Architecture
2, 4, 8 Cores per socket, two logical processors per core
Expect large Nehalem-EX systems with 128-256 logical processors
Intel® QuickPath Architecture
Integrated Memory Controller
Buffered or Un-buffered Memory
*Optional Integrated Graphics
Intel QuickPath Interconnect
Nehalem Nehalem
Nehalem Nehalem
I/O
Hub
I/O
Hub
Nehalem Family
Nehalem-EXNehalem-EP
Havendale
Expandable (4S+)Efficient Performance (2S)
High End Desktop
Mainstream Client
Thin and LightNotebook
Server and Workstation
Auburndale
Clarksfield
Lynnfield
Business and Consumer
Clients
Number Of Processors Increasing
Straining Interrupt addressability limits
Interrupt Addressing
xAPIC has been meeting interrupt architecture needs since P4, for Intel Architecture8 bit APIC ID addresses Physical addresses limited to 0 – 254 APIC IDsLogical addresses limited to 60 processorsFlat addresses limited to 8 processors
Nehalem processors require 16 APIC IDs per socket for 8 processors
Interrupt addressing strained on Nehalem platformsxAPIC can address only up to 8 socket NehalemOnly 128 processors
Next Generation Interrupt Architecture allows scaling to 256 processors on Nehalem platforms
x2APIC is Intel’s next generation interrupt architecture designed to meet interrupt address scaling for future 32-bit APIC ID enables systems with
very large number of processors (4G) Compatible with existing PCI-E/PCI
Devices and IOxAPICs Requires interrupt remapping
support in Platform HW and BIOS Intel® Virtualization Technology for Directed I/O
ACPI 4.0 augmented to enumerate x2APIC OS/VMMs support for x2APIC
and interrupt remapping Minimal SW impact otherwise
x2APIC
Scale Up Futures Server and client
Virtualization and Hypervisor continue to scale
Continued power management innovations
WHEA for Client? ManyCore – 100s of cores
Scale Up and Visual Studio 2010Resource Management and Scheduling
Support for parallel programming paradigms in Visual Studio 2010
Resource Management and Scheduling for concurrent workloads (Microsoft Concurrency Runtime)
Programming models, libraries and tools for managed and native developers
More Information: http://msdn.microsoft.com/concurrency
Enabling Concurrency Runtimes
Reducing kernel intervention in thread scheduling using User Mode Scheduling (UMS)
Core 2
Thread3
Non-running threads
Core 1
Thread4
Thread5
Thread1
Thread2
Thread6
Core 2Core 1
UserThread
2
KernelThread
2
UserThread
1
KernelThread
1
UserThread
3
KernelThread
3
UserThread
4
KernelThread
4
UserThread
5
KernelThread
5
UserThread
6
KernelThread
6
Leverage the topology APIs to build scalable applications and drivers
Enlighten applications and tools to monitor system performance on >64LP machines
Implement Support for x2APIC on >128 logical processor Nehalem-EX platforms
Reduce support and maintenance costs by scaling up on Windows 2008 R2
Questions? Email [email protected]
Call To Action
Whitepaper on how to develop applications and drivers for >64P support <coming soon> http://go.microsoft.com/fwlink/?LinkId=131250
Windows Server Performance Team blog http://blogs.technet.com/winserverperformance/
Concurrency Runtime programming models, libraries and tools http://msdn.microsoft.com/concurrency
Intel® 64 Architecture x2APIC Specification http://download.intel.com/design/processor/specupdt/318148.p
df Intel® Virtualization Technology for Directed I/O
http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf
Resources
Evals & Recordings
Please fill
out your
evaluation for
this session at:
This session will be available as a recording at:
www.microsoftpdc.com
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.