+ All Categories
Home > Documents > One physical processor – may consist of one or more cores One processing unit – may consist of...

One physical processor – may consist of one or more cores One processing unit – may consist of...

Date post: 24-Dec-2015
Category:
Upload: adrian-cunningham
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
38
Developing Applications For More Than 64 Logical Processors In Windows Server 2008 R2
Transcript
Page 1: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Developing Applications For More Than 64 Logical Processors In Windows Server 2008 R2

Page 2: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Terminology Announcing Support for Greater than 64 Logical

Processors Hardware Evolution Application Guidelines

Application Compatibility New Structures and APIs for >64LP

Non-Uniform Memory Architecture (NUMA) Optimizing for Topology

System Tools Intel Enterprise Roadmap The Future of Scale Up

Agenda

Page 3: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Processors can’t continue to get faster and hotter …but Moore’s law still rules

The big iron of today is the commodity server of tomorrow

How does Windows play? Bet on Scale Up? Bet on virtualization and consolidation? Answer: All of the above

The Future Of Server

Page 4: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Processor / Package / Socket: One physical processor – may consist of one or more cores

Core: One processing unit – may consist of one or more logical processor

Logical Processor (LP) / Hardware Thread: One logical computing engine in the OS, application and driver view

NUMA: Non-Uniform Memory Architecture NUMA Node: Group of logical processors and cache that

are in proximity to each other

Group <new>: Logical grouping of up to 64 logical processors

One physical processor – may consist of one or more cores

One processing unit – may consist of one or more logical processors

One logical computing engine in the OS, application and driver view

Non-Uniform Memory Architecture Group of logical processors, cache and

memory that are in proximity to each other I/O is an optional part of a NUMA Node

Logical grouping of up to 64 logical processors

Scale Up Terminology

Page 5: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Scale Up Terminology

128 Logical Processor SystemGroup

(up to 64 logical processors)

NUMA NodeSocket

Core

Logical Processor

Page 6: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Announcing support for up to 256 logical processors Targeting environments with large

single image line-of-business and database applications: SQL Server, SAP, etc.

Refactoring of hot locks in the Windows kernel

Driver stack enlightenment New APIs for >64LP and locality Tools and UI for viewing hundreds

of logical processors and locality

Windows Server 2008 R2

Page 7: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

OLTP Scaling 64LP To 128LP

1.7X

128LP64LP

Page 8: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Hardware Evolution

Industry is moving towards segmented processing complexes CPUs share internal caches between cores CPU and memory are a “node”

“Close” versus “far” hardware can be determined easily

Page 9: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Windows Hardware Error Architecture (WHEA)

Server Power Management Core Parking

Virtualization and Hypervisor

Additional Scale Up Features

Page 10: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Segmented specification – “groups” of CPUs CPUs identified in software by Group# : CPU# Allows backward compatibility with 64-bit affinity New applications have full CPU range using new APIs Permits better locality during scheduling than a “flat” specification

Greater Than 64LP Support

Page 11: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Today – Focus on Server and single LOB applications Intel, AMD, HP, Unisys, IBM, NEC, Fujitsu SQL Server, SAP NICs Storage

Tomorrow – Focus expands to client and the larger driver and application ecosystem

Engagements

Page 12: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

HP Superdome HW ConfigurationIA64 example – 256 Logical Processors

30 HP SmartArray P600 HBAs (x4 3Gb SAS)

60 HP MSA70 Disk Arrays1440 72GB 2.5” 15Krpm Disks

A 64-core configuration can achieve 200,000 IOps

for 8-64 KB requests

64 dual-core hyper-threaded “Montvale” 9100 1.6 GHz

Itanium2 w/ 24 GB LLC

1 TB Memory,4 dual-port 1GB NICs

Page 13: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Unisys ES7000 HW ConfigX64 Example – 128 Logical Processors

32 dual-core hyper-threaded “Tulsa” 7100 3.4 GHz Xeon

w/ 16 GB LLC24 Emulex LP10002 dual-port FC HBAs

48 HP MSA1000/1500 Disk Arrays1540 36GB 3.5” 15Krpm Disks

256 GB Memory

Page 14: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Perf Benchmark Run On 128P System

video

Page 15: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Legacy applications see only a single group Applications are spread

across groups via round-robin Applications only need changes if

They manage per processor information for the whole system – such as task manager

They are performance critical and need to scale beyond 64 logical processors

Applications that do use processor affinity masks or numbers will still operate correctly

There is zero app compat impact on systems with less than 64 logical processors

Application Compatibility

Page 16: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

New StructuresGroup relative processor affinity and processor numbertypedef struct _GROUP_AFFINITY { KAFFINITY Mask; USHORT Group; USHORT Reserved[3];} GROUP_AFFINITY, *PGROUP_AFFINITY;  typedef struct _PROCESSOR_NUMBER { USHORT Group; UCHAR Number; UCHAR Reserved;} PROCESSOR_NUMBER, *PPROCESSOR_NUMBER;

Page 17: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

By default, processes are assigned in a round-robin fashion across groups An application can override this default

using Process Affinity APIsA process starts in a single group but

can expand to contain threads running on all groups in a machine A single thread can never be assigned to

more than one group at any time

Thread And Process Affinity On >64LP Systems

Page 18: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Applications constrained to a single group unless they explicitly create threads on other groups Most applications benefit by locality of resources

Applications need to use extended APIs to affinitize beyond one processor group Performance is the motivation

Applications must understand what work can run independently from others and assign to other groups Beware, on NUMA systems this can hinder,

not help, performance

Affinitized Applications

Page 19: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Some New Functions

GetActiveProcessorGroupCount

Returns the number of active groups in the system

GetMaximumProcessorGroupCount

Returns the maximum number of groups that the system supports

GetActiveProcessorCount Returns the number of active LPs in a group or in the systemGetMaximumProcessorCount

Returns the maximum number of LPs that a group or the system can support

GetThreadGroupAffinity Returns the current group affinity of the thread (in GROUP_AFFINITY)

SetThreadGroupAffinity Sets the affinity of the thread to a set of LPs within a specified group (in GROUP_AFFINITY)

CreateRemoteThreadEx Enables an application to change the default thread group affinity and specify an ideal LP for a thread

GetNumaNodeNumberFromHandle

Returns the node number associated with a handle (e.g. file and socket handles)

Page 20: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Apps scaling beyond 64 logical processors should be NUMA aware

Future server designs will be NUMA A process or thread can

set a preferred NUMA node This helps with groups assignments

as nodes must not cross groups Also provides a performance

optimization inside a group

NUMA NodesNon-Uniform Memory Architecture

Page 21: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

What Is Non-Uniform Memory Access (NUMA) I/O?

A picture is worth a 1000 words!

Page 22: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Cache(s)

(0)

(3)

(4)(1)

(7) I/O InitiatorISR

I/O Buffer Home

DPC(2)(6)

(5)

P2

Cache2

DiskB

Locked out for I/O Initiation

Locked out for I/O Initiation

Bad Case Disk Write Unoptimized driver

Page 23: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Cache(s)

(3)

(3)

I/O Initiator

ISR DPC

(2)P2

Cache2

DiskB

ISR

(2)

Using NUMA APIsOptimization for Topology

Page 24: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

System Topology APIs

Processor topology exposed (user/kernel) NUMA nodes Sockets Cores Logical

Processors (e.g., hyper-threads) Includes details of CPU caches (L0/L1/L2)

Memory topology exposed Device location exposed (e.g.,

network and storage controllers) Determine associated NUMA node(s) via ACPI

Proximity Domains (BIOS support required)

Page 25: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

System Tools

Performance Monitor Processor to node relationship information Total system logical processor utilization

Task Manager Average CPU utilization per LP Average CPU utilization for a given node

Resource Monitor

Page 26: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Intel® Enterprise Roadmap

All dates, product features and plans are subject to change without notice.

Intel® Xeon® MP7000 Sequence(Expandable, MC)

Intel® Xeon® DP5000 Sequence(Efficient Performance)

Intel® Xeon® UP3000 Sequence(Entry)

Intel® Xeon® WS5000 Sequence(Workstation & HPC)

Intel® Itanium®

9000 Sequence(Mission Critical)

Future2008

Nehalem ProcessorFuture EX Chipset

Itanium Processor870 / OEM Chipset Future MC Chipset

Kittson

Nehalem ProcessorFuture EN Chipset

Quad/Dual-core Xeon ProcessorIntel® 3200 Chipset

Quad/Dual-core Xeon ProcessorIntel® 7300 Chipset

Quad/Dual-core Xeon ProcessorIntel® 5100 Chipset

Nehalem ProcessorFuture EP Chipset

Quad/Dual-core Xeon ProcessorIntel® 5000 Chipset

Nehalem ProcessorFuture EP Chipset

Quad/Dual-core Xeon ProcessorIntel® 5400 Chipset

Nehalem ProcessorFuture WS Chipset

Dunnington

PoulsonTukwila

Page 27: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

PCI Express*

I/OHub

ICH

DMI

Nehalem Nehalem

PCIExpress*

PCIExpress*

Nehalem Based System Architecture

2, 4, 8 Cores per socket, two logical processors per core

Expect large Nehalem-EX systems with 128-256 logical processors

Intel® QuickPath Architecture

Integrated Memory Controller

Buffered or Un-buffered Memory

*Optional Integrated Graphics

Intel QuickPath Interconnect

Nehalem Nehalem

Nehalem Nehalem

I/O

Hub

I/O

Hub

Page 28: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Nehalem Family

Nehalem-EXNehalem-EP

Havendale

Expandable (4S+)Efficient Performance (2S)

High End Desktop

Mainstream Client

Thin and LightNotebook

Server and Workstation

Auburndale

Clarksfield

Lynnfield

Business and Consumer

Clients

Page 29: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Number Of Processors Increasing

Straining Interrupt addressability limits

Page 30: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Interrupt Addressing

xAPIC has been meeting interrupt architecture needs since P4, for Intel Architecture8 bit APIC ID addresses Physical addresses limited to 0 – 254 APIC IDsLogical addresses limited to 60 processorsFlat addresses limited to 8 processors

Nehalem processors require 16 APIC IDs per socket for 8 processors

Interrupt addressing strained on Nehalem platformsxAPIC can address only up to 8 socket NehalemOnly 128 processors

Next Generation Interrupt Architecture allows scaling to 256 processors on Nehalem platforms

Page 31: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

x2APIC is Intel’s next generation interrupt architecture designed to meet interrupt address scaling for future 32-bit APIC ID enables systems with

very large number of processors (4G) Compatible with existing PCI-E/PCI

Devices and IOxAPICs Requires interrupt remapping

support in Platform HW and BIOS Intel® Virtualization Technology for Directed I/O

ACPI 4.0 augmented to enumerate x2APIC OS/VMMs support for x2APIC

and interrupt remapping Minimal SW impact otherwise

x2APIC

Page 32: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Scale Up Futures Server and client

Virtualization and Hypervisor continue to scale

Continued power management innovations

WHEA for Client? ManyCore – 100s of cores

Page 33: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Scale Up and Visual Studio 2010Resource Management and Scheduling

Support for parallel programming paradigms in Visual Studio 2010

Resource Management and Scheduling for concurrent workloads (Microsoft Concurrency Runtime)

Programming models, libraries and tools for managed and native developers

More Information: http://msdn.microsoft.com/concurrency

Page 34: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Enabling Concurrency Runtimes

Reducing kernel intervention in thread scheduling using User Mode Scheduling (UMS)

Core 2

Thread3

Non-running threads

Core 1

Thread4

Thread5

Thread1

Thread2

Thread6

Core 2Core 1

UserThread

2

KernelThread

2

UserThread

1

KernelThread

1

UserThread

3

KernelThread

3

UserThread

4

KernelThread

4

UserThread

5

KernelThread

5

UserThread

6

KernelThread

6

Page 35: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Leverage the topology APIs to build scalable applications and drivers

Enlighten applications and tools to monitor system performance on >64LP machines

Implement Support for x2APIC on >128 logical processor Nehalem-EX platforms

Reduce support and maintenance costs by scaling up on Windows 2008 R2

Questions? Email [email protected]

Call To Action

Page 36: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Whitepaper on how to develop applications and drivers for >64P support <coming soon> http://go.microsoft.com/fwlink/?LinkId=131250

Windows Server Performance Team blog http://blogs.technet.com/winserverperformance/

Concurrency Runtime programming models, libraries and tools http://msdn.microsoft.com/concurrency

Intel® 64 Architecture x2APIC Specification http://download.intel.com/design/processor/specupdt/318148.p

df Intel® Virtualization Technology for Directed I/O

http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf

Resources

Page 37: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

Evals & Recordings

Please fill

out your

evaluation for

this session at:

This session will be available as a recording at:

www.microsoftpdc.com

Page 38: One physical processor – may consist of one or more cores One processing unit – may consist of one or more logical processors One logical computing.

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market

conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Recommended