One physical processor – may consist of one or more cores One processing unit – may consist of...

transcript

Developing Applications For More Than 64 Logical Processors In Windows Server 2008 R2

Terminology Announcing Support for Greater than 64 Logical

Processors Hardware Evolution Application Guidelines

Application Compatibility New Structures and APIs for >64LP

Non-Uniform Memory Architecture (NUMA) Optimizing for Topology

System Tools Intel Enterprise Roadmap The Future of Scale Up

Agenda

Processors can’t continue to get faster and hotter …but Moore’s law still rules

The big iron of today is the commodity server of tomorrow

How does Windows play? Bet on Scale Up? Bet on virtualization and consolidation? Answer: All of the above

The Future Of Server

Processor / Package / Socket: One physical processor – may consist of one or more cores

Core: One processing unit – may consist of one or more logical processor

Logical Processor (LP) / Hardware Thread: One logical computing engine in the OS, application and driver view

NUMA: Non-Uniform Memory Architecture NUMA Node: Group of logical processors and cache that

are in proximity to each other

Group <new>: Logical grouping of up to 64 logical processors

One physical processor – may consist of one or more cores

One processing unit – may consist of one or more logical processors

One logical computing engine in the OS, application and driver view

Non-Uniform Memory Architecture Group of logical processors, cache and

memory that are in proximity to each other I/O is an optional part of a NUMA Node

Logical grouping of up to 64 logical processors

Scale Up Terminology

128 Logical Processor SystemGroup

(up to 64 logical processors)

NUMA NodeSocket

Logical Processor

Announcing support for up to 256 logical processors Targeting environments with large

single image line-of-business and database applications: SQL Server, SAP, etc.

Refactoring of hot locks in the Windows kernel

Driver stack enlightenment New APIs for >64LP and locality Tools and UI for viewing hundreds

of logical processors and locality

Windows Server 2008 R2

OLTP Scaling 64LP To 128LP

128LP64LP

Hardware Evolution

Industry is moving towards segmented processing complexes CPUs share internal caches between cores CPU and memory are a “node”

“Close” versus “far” hardware can be determined easily

Windows Hardware Error Architecture (WHEA)

Server Power Management Core Parking

Virtualization and Hypervisor

Additional Scale Up Features

Segmented specification – “groups” of CPUs CPUs identified in software by Group# : CPU# Allows backward compatibility with 64-bit affinity New applications have full CPU range using new APIs Permits better locality during scheduling than a “flat” specification

Greater Than 64LP Support

Today – Focus on Server and single LOB applications Intel, AMD, HP, Unisys, IBM, NEC, Fujitsu SQL Server, SAP NICs Storage

Tomorrow – Focus expands to client and the larger driver and application ecosystem

Engagements

HP Superdome HW ConfigurationIA64 example – 256 Logical Processors

30 HP SmartArray P600 HBAs (x4 3Gb SAS)

60 HP MSA70 Disk Arrays1440 72GB 2.5” 15Krpm Disks

A 64-core configuration can achieve 200,000 IOps

for 8-64 KB requests

64 dual-core hyper-threaded “Montvale” 9100 1.6 GHz

Itanium2 w/ 24 GB LLC

1 TB Memory,4 dual-port 1GB NICs

Unisys ES7000 HW ConfigX64 Example – 128 Logical Processors

32 dual-core hyper-threaded “Tulsa” 7100 3.4 GHz Xeon

w/ 16 GB LLC24 Emulex LP10002 dual-port FC HBAs

48 HP MSA1000/1500 Disk Arrays1540 36GB 3.5” 15Krpm Disks

256 GB Memory

Perf Benchmark Run On 128P System

Legacy applications see only a single group Applications are spread

across groups via round-robin Applications only need changes if

They manage per processor information for the whole system – such as task manager

They are performance critical and need to scale beyond 64 logical processors

Applications that do use processor affinity masks or numbers will still operate correctly

There is zero app compat impact on systems with less than 64 logical processors

Application Compatibility

New StructuresGroup relative processor affinity and processor numbertypedef struct _GROUP_AFFINITY { KAFFINITY Mask; USHORT Group; USHORT Reserved[3];} GROUP_AFFINITY, *PGROUP_AFFINITY; typedef struct _PROCESSOR_NUMBER { USHORT Group; UCHAR Number; UCHAR Reserved;} PROCESSOR_NUMBER, *PPROCESSOR_NUMBER;

By default, processes are assigned in a round-robin fashion across groups An application can override this default

using Process Affinity APIsA process starts in a single group but

can expand to contain threads running on all groups in a machine A single thread can never be assigned to

more than one group at any time

Thread And Process Affinity On >64LP Systems

Applications constrained to a single group unless they explicitly create threads on other groups Most applications benefit by locality of resources

Applications need to use extended APIs to affinitize beyond one processor group Performance is the motivation

Applications must understand what work can run independently from others and assign to other groups Beware, on NUMA systems this can hinder,

not help, performance

Affinitized Applications

Some New Functions

GetActiveProcessorGroupCount

Returns the number of active groups in the system

GetMaximumProcessorGroupCount

Returns the maximum number of groups that the system supports

GetActiveProcessorCount Returns the number of active LPs in a group or in the systemGetMaximumProcessorCount

Returns the maximum number of LPs that a group or the system can support

GetThreadGroupAffinity Returns the current group affinity of the thread (in GROUP_AFFINITY)

SetThreadGroupAffinity Sets the affinity of the thread to a set of LPs within a specified group (in GROUP_AFFINITY)

CreateRemoteThreadEx Enables an application to change the default thread group affinity and specify an ideal LP for a thread

GetNumaNodeNumberFromHandle

Returns the node number associated with a handle (e.g. file and socket handles)

Apps scaling beyond 64 logical processors should be NUMA aware

Future server designs will be NUMA A process or thread can

set a preferred NUMA node This helps with groups assignments

as nodes must not cross groups Also provides a performance

optimization inside a group

NUMA NodesNon-Uniform Memory Architecture

What Is Non-Uniform Memory Access (NUMA) I/O?

A picture is worth a 1000 words!

Cache1

Node Interconnect

MemBDiskA

Cache3

Cache4

Cache(s)

(4)(1)

(7) I/O InitiatorISR

I/O Buffer Home

DPC(2)(6)

Cache2

Locked out for I/O Initiation

Bad Case Disk Write Unoptimized driver

Cache1

Node Interconnect

MemBDiskA

Cache3

Cache4

Cache(s)

I/O Initiator

ISR DPC

Cache2

Using NUMA APIsOptimization for Topology

System Topology APIs

Processor topology exposed (user/kernel) NUMA nodes Sockets Cores Logical

Processors (e.g., hyper-threads) Includes details of CPU caches (L0/L1/L2)

Memory topology exposed Device location exposed (e.g.,

network and storage controllers) Determine associated NUMA node(s) via ACPI

Proximity Domains (BIOS support required)

System Tools

Performance Monitor Processor to node relationship information Total system logical processor utilization

Task Manager Average CPU utilization per LP Average CPU utilization for a given node

Resource Monitor

Intel® Enterprise Roadmap

All dates, product features and plans are subject to change without notice.

Intel® Xeon® MP7000 Sequence(Expandable, MC)

Intel® Xeon® DP5000 Sequence(Efficient Performance)

Intel® Xeon® UP3000 Sequence(Entry)

Intel® Xeon® WS5000 Sequence(Workstation & HPC)

Intel® Itanium®

9000 Sequence(Mission Critical)

Future2008

Nehalem ProcessorFuture EX Chipset

Itanium Processor870 / OEM Chipset Future MC Chipset

Kittson

Nehalem ProcessorFuture EN Chipset

Quad/Dual-core Xeon ProcessorIntel® 3200 Chipset

Nehalem ProcessorFuture EP Chipset

Nehalem ProcessorFuture WS Chipset

Dunnington

PoulsonTukwila

PCI Express*

I/OHub

Nehalem Nehalem

PCIExpress*

Nehalem Based System Architecture

2, 4, 8 Cores per socket, two logical processors per core

Expect large Nehalem-EX systems with 128-256 logical processors

Intel® QuickPath Architecture

Integrated Memory Controller

Buffered or Un-buffered Memory

*Optional Integrated Graphics

Intel QuickPath Interconnect

Nehalem Nehalem

Nehalem Family

Nehalem-EXNehalem-EP

Havendale

Expandable (4S+)Efficient Performance (2S)

High End Desktop

Mainstream Client

Thin and LightNotebook

Server and Workstation

Auburndale

Clarksfield

Lynnfield

Business and Consumer

Clients

Number Of Processors Increasing

Straining Interrupt addressability limits

Interrupt Addressing

xAPIC has been meeting interrupt architecture needs since P4, for Intel Architecture8 bit APIC ID addresses Physical addresses limited to 0 – 254 APIC IDsLogical addresses limited to 60 processorsFlat addresses limited to 8 processors

Nehalem processors require 16 APIC IDs per socket for 8 processors

Interrupt addressing strained on Nehalem platformsxAPIC can address only up to 8 socket NehalemOnly 128 processors

Next Generation Interrupt Architecture allows scaling to 256 processors on Nehalem platforms

x2APIC is Intel’s next generation interrupt architecture designed to meet interrupt address scaling for future 32-bit APIC ID enables systems with

very large number of processors (4G) Compatible with existing PCI-E/PCI

Devices and IOxAPICs Requires interrupt remapping

support in Platform HW and BIOS Intel® Virtualization Technology for Directed I/O

ACPI 4.0 augmented to enumerate x2APIC OS/VMMs support for x2APIC

and interrupt remapping Minimal SW impact otherwise

x2APIC

Scale Up Futures Server and client

Virtualization and Hypervisor continue to scale

Continued power management innovations

WHEA for Client? ManyCore – 100s of cores

Scale Up and Visual Studio 2010Resource Management and Scheduling

Support for parallel programming paradigms in Visual Studio 2010

Resource Management and Scheduling for concurrent workloads (Microsoft Concurrency Runtime)

Programming models, libraries and tools for managed and native developers

More Information: http://msdn.microsoft.com/concurrency

Enabling Concurrency Runtimes

Reducing kernel intervention in thread scheduling using User Mode Scheduling (UMS)

Core 2

Thread3

Non-running threads

Core 1

Thread4

Thread5

Thread1

Thread2

Thread6

Core 2Core 1

UserThread

KernelThread

UserThread

KernelThread

UserThread

KernelThread

UserThread

KernelThread

UserThread

KernelThread

UserThread

KernelThread

Leverage the topology APIs to build scalable applications and drivers

Enlighten applications and tools to monitor system performance on >64LP machines

Implement Support for x2APIC on >128 logical processor Nehalem-EX platforms

Reduce support and maintenance costs by scaling up on Windows 2008 R2

Questions? Email GT64P@microsoft.com

Call To Action

Whitepaper on how to develop applications and drivers for >64P support <coming soon> http://go.microsoft.com/fwlink/?LinkId=131250

Windows Server Performance Team blog http://blogs.technet.com/winserverperformance/

Concurrency Runtime programming models, libraries and tools http://msdn.microsoft.com/concurrency

Intel® 64 Architecture x2APIC Specification http://download.intel.com/design/processor/specupdt/318148.p

df Intel® Virtualization Technology for Directed I/O

http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf

Resources

Evals & Recordings

Please fill

out your

evaluation for

this session at:

This session will be available as a recording at:

www.microsoftpdc.com

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market

conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

One physical processor – may consist of one or more cores One processing unit – may consist of...

Documents