Click here to load reader
Date post: | 15-Nov-2014 |
Category: |
Documents |
Upload: | complexsplit |
View: | 176 times |
Download: | 6 times |
Click here to load reader
A Look Inside Intel®: The Core (Nehalem)
Microarchitecture
Beeman StrongIntel® Core™ microarchitecture
(Nehalem) ArchitectIntel Corporation
2
• Intel® Core™ Microarchitecture (Nehalem) Design Overview
• Enhanced Processor Core•Performance Features• Intel® Hyper-Threading Technology
• New Platform•New Cache Hierarchy•New Platform Architecture
• Performance Acceleration•Virtualization•New Instructions
• Power Management Overview•Minimizing Idle Power Consumption•Performance when it counts
Agenda
3
Scalable Cores
Common feature setCommon feature setSame core forSame core forall segmentsall segments
Common softwareCommon softwareoptimizationoptimization
45nm45nm
Servers/WorkstationsEnergy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability
DesktopPerformance, Graphics, Energy Efficiency, Idle Power, Security
MobileBattery Life, Performance, Energy Efficiency, Graphics, Security
Optimized cores to meet all market Optimized cores to meet all market segmentssegments
Intel® Core™ Intel® Core™ Microarchitecture Microarchitecture
(Nehalem)(Nehalem)
4
The First Intel® Core™ Microarchitecture (Nehalem) Processor
A Modular Design for FlexibilityA Modular Design for Flexibility
Misc IO
Misc IO
QPI 1
QPI 0
Memory Controller
Core Core Core CoreQueue
Shared L3 Cache
QPI: Intel® QuickPath Interconnect (Intel®
QPI)
5
• Intel® Core™ Microarchitecture (Nehalem) Design Overview
• Enhanced Processor Core•Performance Features• Intel® Hyper-Threading Technology
• New Platform•New Cache Hierarchy•New Platform Architecture
• Performance Acceleration•Virtualization•New Instructions
• Power Management Overview•Minimizing Idle Power Consumption•Performance when it counts
Agenda
6
Intel® Core™ Microarchitecture Recap
• Wide Dynamic Execution– 4-wide decode/rename/retire
• Advanced Digital Media Boost– 128-bit wide SSE execution units
• Intel HD Boost– New SSE4.1 Instructions
• Smart Memory Access– Memory Disambiguation– Hardware Prefetching
• Advanced Smart Cache– Low latency, high BW shared L2 cache
Nehalem builds on the great Core microarchitecture
7
Designed for Performance
ExecutionUnits
Out-of-OrderScheduling &Retirement
L2 Cache& InterruptServicing
Instruction Fetch& L1 Cache
Branch PredictionInstructionDecode &Microcode
Paging
L1 Data Cache
Memory Ordering& Execution
Additional CachingHierarchy
New SSE4.2 Instructions
Deeper Buffers
FasterVirtualization
SimultaneousMulti-Threading
Better BranchPrediction
Improved Lock
Support
ImprovedLoop
Streaming
8
Macrofusion• Introduced in Intel® Core™2 microarchitecture• TEST/CMP instruction followed by a conditional branch treated
as a single instruction– Decode/execute/retire as one instruction
• Higher performance & improved power efficiency– Improves throughput/Reduces execution latency– Less processing required to accomplish the same work
• Support all the cases in Intel Core 2 microarchitecture PLUS– CMP+Jcc macrofusion added for the following branch conditions
– JL/JNGE– JGE/JNL– JLE/JNG– JG/JNLE
– Intel® Core™ microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes – Intel Core2 microarchitecture only supports macrofusion in 32-bit mode
Increased macrofusion benefit on Intel®
Core™ microarchitecture (Nehalem)
9
Intel® Core™ Microarchitecture (Nehalem) Loop Stream Detector
• Loop Stream Detector identifies software loops– Stream from Loop Stream Detector instead of normal path– Disable unneeded blocks of logic for power savings– Higher performance by removing instruction fetch limitations
• Higher performance: Expand the size of the loops detected (vs Core 2)• Improved power efficiency: Disable even more logic (vs Core 2)
Intel Core Microarchitecture (Nehalem) Loop Stream Detector
Branch
PredictionFetch Decode
Loop
Stream
Detector
28
Micro-Ops
10
Branch Prediction Improvements
•Focus on improving branch prediction accuracy each CPU generation– Higher performance & lower power through more
accurate prediction
•Example Intel® Core™ microarchitecture (Nehalem) improvements– L2 Branch Predictor
– Improve accuracy for applications with large code size (ex. database applications)
– Advanced Renamed Return Stack Buffer (RSB)– Remove branch mispredicts on x86 RET instruction (function
returns) in the common case
Greater Performance through Branch Prediction
11
Execute 6 operations/cycle• 3 Memory Operations
• 1 Load• 1 Store Address• 1 Store Data
• 3 “Computational” Operations
Execution Unit Overview
Unified Reservation Station
Port 0
Port 1
Port 2
Port 3
Port 4
Port 5
Load Store
Address
Store
Data
Integer ALU &
ShiftInteger ALU &
LEA
Integer ALU &
Shift
BranchFP AddFP Multiply
Complex
IntegerDivide
SSE Integer ALU
Integer ShufflesSSE Integer
Multiply
FP Shuffle
SSE Integer ALU
Integer Shuffles
Unified Reservation Station• Schedules operations to Execution units• Single Scheduler for all Execution Units• Can be used by all integer, all FP, etc.
12
Increased Parallelism• Goal: Keep powerful
execution engine fed• Nehalem increases size of
out of order window by 33%• Must also increase other
corresponding structures 0
16
32
48
64
80
96
112
128
Dothan Merom Nehalem
Concurrent uOps Possible
Increased Resources for Higher Performance
Structure Intel® Core™ microarchitecture (formerly Merom)
Intel® Core™
microarchitecture (Nehalem)
Comment
Reservation Station 32 36 Dispatches operations to execution units
Load Buffers 32 48 Tracks all load operations allocated
Store Buffers 20 32 Tracks all store operations allocated
1Intel® Pentium® M processor (formerly Dothan)Intel® Core™ microarchitecture (formerly Merom)Intel® Core™ microarchitecture (Nehalem)
1
13
Enhanced Memory Subsystem
•Responsible for:– Handling of memory operations (loads/stores)
•Key Intel® Core™2 Features– Memory Disambiguation – Hardware Prefetchers – Advanced Smart Cache
•New Intel® Core™ Microarchitecture (Nehalem) Features– New TLB Hierarchy (new, low latency 2nd level unified TLB)– Fast 16-Byte unaligned accesses– Faster Synchronization Primitives
14
Intel® Hyper-Threading Technology • Also known as Simultaneous Multi-
Threading (SMT)– Run 2 threads at the same time per core
• Take advantage of 4-wide execution engine– Keep it fed with multiple threads– Hide latency of a single thread
• Most power efficient performance feature– Very low die area cost– Can provide significant performance benefit
depending on application– Much more efficient than adding an entire
core
• Intel® Core™ microarchitecture (Nehalem) advantages– Larger caches– Massive memory BW
Simultaneous multi-threading enhances performance and energy efficiency
Tim
e (p
roc.
cyc
les)
w/o SMT SMT
Note: Each box represents a
processor execution unit
15
SMT Performance Chart
Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/
SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http://www.spec.org
7%10%
13%16%
29%
34%
0%
5%
10%
15%
20%
25%
30%
35%
40%
Floating Point 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 beta 25
3DMark* Vantage* CPU
Performance Gain SMT enabled vs disabled
Intel® Core™ i7
Floating Point is based on SPECfp_rate_base2006* estimateInteger is based on SPECint_rate_base2006* estimate
16
• Intel® Core™ Microarchitecture (Nehalem) Design Overview
• Enhanced Processor Core•Performance Features• Intel® Hyper-Threading Technology
• New Platform•New Cache Hierarchy•New Platform Architecture
• Performance Acceleration•Virtualization•New Instructions
• Power Management Overview•Minimizing Idle Power Consumption•Performance when it counts
Agenda
17
Designed For Modularity
Optimal price / performance / energy Optimal price / performance / energy efficiencyefficiency
for server, desktop and mobile productsfor server, desktop and mobile products
DRAM
Intel QPIIntel QPI
Core
Uncore
CORE
CORE
CORE
IMC
Intel®
QPI
Power Power &&
ClockClock
#QPI#QPILinksLinks
# mem# memchannelschannels
Size ofSize ofcachecache# cores# cores
PowerPowerManage-Manage-
mentment
Type ofType ofMemoryMemory
IntegratedIntegratedgraphicsgraphics
Differentiation in the “Uncore”:
2008 – 2009 Servers & Desktops
…
…
…
…
L3 Cache
Intel® QPI: Intel® QuickPath Interconnect
(Intel® QPI)
Intel®
QPI
18
Intel® Smart Cache – 3rd Level Cache
• Shared across all cores• Size depends on # of cores
– Quad-core: Up to 8MB (16-ways)– Scalability:
– Built to vary size with varied core counts
– Built to easily increase L3 size in future parts
• Perceived latency depends on frequency ratio between core & uncore
• Inclusive cache policy for best performance– Address residing in L1/L2 must be
present in 3rd level cache
…
L3 Cache
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
19
Why Inclusive?
• Inclusive cache provides benefit of an on-die snoop filter• Core Valid Bits
– 1 bit per core per cache line– If line may be in a core, set core valid bit– Snoop only needed if line is in L3 and core valid bit is set– Guaranteed that line is not modified if multiple bits set
• Scalability– Addition of cores/sockets does not increase snoop traffic seen by
cores
• Latency– Minimize effective cache latency by eliminating cross-core snoops
in the common case– Minimize snoop response time for cross-socket cases
20
Intel® Core™ Microarchitecture (Nehalem-EP) Platform Architecture• Integrated Memory Controller
– 3 DDR3 channels per socket– Massive memory bandwidth– Memory Bandwidth scales with # of
processors– Very low memory latency
• Intel® QuickPath Interconnect (Intel® QPI)
– New point-to-point interconnect– Socket to socket connections– Socket to chipset connections– Build scalable solutions– Up to 6.4 GT/sec (12.8 GB/sec)– Bidirectional (=> 25.6 GB/sec)
NehalemEP
NehalemEP
Tylersburg EP
Significant performance
leap from new platform
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
IOH
memoryCPU CPU
CPU CPU
IOH
memory
memory
memory
21
Non-Uniform Memory Access (NUMA)
• FSB architecture– All memory in one location
• Starting with Intel® Core™ microarchitecture (Nehalem)– Memory located in multiple
places• Latency to memory
dependent on location• Local memory has highest
BW, lowest latency• Remote Memory still very fast
Ensure software is NUMA-
optimized for best performance
NehalemEP
NehalemEP
Tylersburg EP
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
Relative Memory Latency Comparison
0.00
0.20
0.40
0.60
0.80
1.00
Harpertown (FSB 1600) Nehalem (DDR3-1067)
Local
Nehalem (DDR3-1067)
Remote
Rel
ativ
e M
emo
ry L
aten
cy
22
Memory Bandwidth – Initial Intel® Core™ Microarchitecture (Nehalem) Products
• 3 memory channels per socket• ≥ DDR3-1066 at launch• Massive memory BW• Scalability
– Design IMC and core to take advantage of BW
– Allow performance to scale with cores
– Core enhancements– Support more cache misses per
core– Aggressive hardware prefetching w/
throttling enhancements
– Example IMC Features– Independent memory channels – Aggressive Request Reordering
Massive memory BW provides performance and scalability
Stream Bandwidth – Mbytes/Sec Stream Bandwidth – Mbytes/Sec (Triad)(Triad)
9776
33376
6102
HTN 3.16/ BF1333/667 MHz mem
HTN 3.00/ SB1600/800 MHz mem
NHM 2.66/ 6.4 QPI/1066 MHz mem
Source: Intel Internal measurements – August 2008Source: Intel Internal measurements – August 200811
3.4X
1HTN: Intel® Xeon® processor 5400 Series (Harpertown)
NHM: Intel® Core™ microarchitecture (Nehalem)
23
• Intel® Core™ Microarchitecture (Nehalem) Design Overview
• Enhanced Processor Core•Performance Features• Intel® Hyper-Threading Technology
• New Platform•New Cache Hierarchy•New Platform Architecture
• Performance Acceleration•Virtualization•New Instructions
• Power Management Overview•Minimizing Idle Power Consumption•Performance when it counts
Agenda
24
Virtualization
• To get best virtualized performance– Have best native performance– Reduce transitions to/from virtual
machine– Reduce latency of transitions
• Intel® Core™ microprocessor (Nehalem) virtualization features– Reduced latency for transitions – Virtual Processor ID (VPID) to
reduce effective cost of transitions
– Extended Page Table (EPT) to reduce # of transitions
0%
20%
40%
60%
80%
100%
Rel
ativ
e La
ten
cy
Merom Penryn Nehalem
Round Trip Virtualization Latency
1
25
EPT Solution
• Intel® 64 Page Tables– Map Guest Linear Address to Guest Physical Address– Can be read and written by the guest OS
• New EPT Page Tables under VMM Control– Map Guest Physical Address to Host Physical Address– Referenced by new EPT base pointer
• No VM Exits due to Page Faults, INVLPG or CR3 accesses
Intel® 64Page Tables
Guest
Linear
Address
EPTPage Tables
CR3
Guest
Physical
Address
EPT
Base Pointer
Host
Physical
Address
26
SSE4.2SSE4.2(Nehalem Core)(Nehalem Core)
STTNISTTNIe.g. XML e.g. XML
accelerationacceleration
POPCNTPOPCNTe.g. Genome e.g. Genome
MiningMining
ATAATA(Application(Application Targeted Targeted
Accelerators)Accelerators)
SSE4.1SSE4.1(Penryn Core)(Penryn Core)
SSE4SSE4(45nm CPUs)(45nm CPUs)
CRC32CRC32e.g. iSCSI e.g. iSCSI ApplicationApplication
New Communications New Communications CapabilitiesCapabilities
Hardware based CRC instruction Accelerated Network attached storageImproved power efficiency for Software I-SCSI, RDMA, and SCTP
Accelerated Searching Accelerated Searching & Pattern Recognition & Pattern Recognition of Large Data Setsof Large Data Sets
Improved performance for Genome Mining, Handwriting recognition.Fast Hamming distance / Population count
Accelerated Accelerated String and Text String and Text ProcessingProcessing
Faster XML parsingFaster search and pattern matchingNovel parallel data matching and comparison operations
STTNI
ATA
Extending Performance and Energy Efficiency- Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008
What should the applications, OS and VMM vendors do?:
Understand the benefits & take advantage of new instructions in 2008.
Provide us feedback on instructions ISV would like to see for
next generation of applications
27
• Intel® Core™ Microarchitecture (Nehalem) Design Overview
• Enhanced Processor Core•Performance Features• Intel® Hyper-Threading Technology
• New Platform•New Cache Hierarchy•New Platform Architecture
• Performance Acceleration•Virtualization•New Instructions
• Power Management Overview•Minimizing Idle Power Consumption•Performance when it counts
Agenda
28
Intel® Core™ Microarchitecture (Nehalem) Design Goals
Existing Existing AppsAppsEmerging AppsEmerging AppsAll UsagesAll Usages
Single Single ThreadThread
Multi-Multi-threadsthreads
Workstation / Server
Desktop / Mobile
World class performance combined with superior energy efficiency – World class performance combined with superior energy efficiency – Optimized for: Optimized for:
A single, scalable, foundation optimized across each segment and power A single, scalable, foundation optimized across each segment and power envelope envelope
Dynamically scaled performance when
needed to maximize energy efficiency
A Dynamic and Design Scalable A Dynamic and Design Scalable MicroarchitectureMicroarchitecture
29
Power Control Unit
PLL
UncoreUncore, , LLCLLC
Core Core
VccVcc
FreqFreq..
SensorsSensors
Core Core
VccVcc
FreqFreq..
SensorsSensors
Core Core
VccVcc
FreqFreq..
SensorsSensors
Core Core
VccVcc
FreqFreq..
SensorsSensors
PLL
PLL
PLL
PLL
PCUPCU
BCLKVcc
Integrated proprietary Integrated proprietary microcontrollermicrocontroller
Shifts control from Shifts control from hardware to embedded hardware to embedded firmwarefirmware
Real time sensors for Real time sensors for temperature, current, temperature, current, powerpower
Flexibility enables Flexibility enables sophisticated sophisticated algorithms, tuned for algorithms, tuned for current operating current operating conditionsconditions
30
Minimizing Idle Power Consumption
•Operating system notifies CPU when no tasks are ready for execution– Execution of MWAIT instruction
•MWAIT arguments hint at expected idle duration– Higher numbered C-states
lower power, but alsolonger exit latency
•CPU idle states referredto as “C-States”
C0
CnC1
Exit Latency (us)
Idle
Pow
er
(W)
31
C6 on Intel® Core™ Microarchitecture (Nehalem)
32
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Cores 0, 1, 2, Cores 0, 1, 2, and 3 running and 3 running applications.applications.
33
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Task completes. No Task completes. No work waiting. OS work waiting. OS executes MWAIT(C6) executes MWAIT(C6) instruction.instruction.
34
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Execution stops. Core Execution stops. Core architectural state architectural state saved. Core clocks saved. Core clocks stopped. Cores 0, 1, stopped. Cores 0, 1, and 3 continue and 3 continue execution undisturbed.execution undisturbed.
35
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core power gate turned Core power gate turned off. Core voltage goes off. Core voltage goes to 0. Cores 0, 1, and 3 to 0. Cores 0, 1, and 3 continue execution continue execution undisturbed.undisturbed.
36
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Task completes. No work Task completes. No work waiting. OS executes waiting. OS executes MWAIT(C6) instruction. Core MWAIT(C6) instruction. Core 0 enters C6. Cores 1 and 3 0 enters C6. Cores 1 and 3 continue execution continue execution undisturbed.undisturbed.
37
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Interrupt for Core 2 arrives. Interrupt for Core 2 arrives. Core 2 returns to C0, Core 2 returns to C0, execution resumes at execution resumes at instruction following instruction following MWAIT(C6). Cores 1 and 3 MWAIT(C6). Cores 1 and 3 continue execution continue execution undisturbed.undisturbed.
38
C6 on Intel® Core™ Microarchitecture (Nehalem)
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core independent C6 on Intel Coremicroarchitecture (Nehalem) extends benefits
Interrupt for Core 0 arrives. Interrupt for Core 0 arrives. Power gate turns on, core Power gate turns on, core clock turns on, core state clock turns on, core state restored, core resumes restored, core resumes execution at instruction execution at instruction following MWAIT(C6). Cores 1, following MWAIT(C6). Cores 1, 2, and 3 continue execution 2, and 3 continue execution undisturbed.undisturbed.
39
Intel® Core™ Microarchitecture (Nehalem)-based Processor
•Significant logic outside core– Integrated memory controller– Large shared cache– High speed interconnect– Arbitration logic
Core Leakage
Core Clock Distribution
Core Clocks
and Logic
Total CPU Power Consumption
Uncore Leakage
Uncore Clock Distribution
I/O
Uncore Logic
Core
s (x
N)
QPI 1
QPI 0
Memory Controller
Core Core Core Core
Shared L3 Cache
Misc IO
Misc IO
Queue
QPI = Intel® QuickPath Interconnect (Intel® QPI)
40
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
Active CPU Power
Core Leakage
Core Clock Distribution
Core Clocks
and Logic
Uncore Leakage
Uncore Clock Distribution
I/O
Uncore Logic
Core
s (x
N)
•All cores in C6 state:– Core power to ~0
41
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
Active CPU Power
Uncore Leakage
Uncore Clock Distribution
I/O
Uncore Logic
•All cores in C6 state:– Core power to ~0
•Package to C6 state:– Uncore logic stops toggling
42
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
Active CPU Power
Uncore Leakage
Uncore Clock Distribution
I/O
•All cores in C6 state:– Core power to ~0
•Package to C6 state:– Uncore logic stops toggling– I/O to lower power state
43
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
•All cores in C6 state:– Core power to ~0
•Package to C6 state:– Uncore logic stops toggling– I/O to lower power state– Uncore clock grids stopped
Active CPU Power
Uncore Leakage
Uncore Clock Distribution
I/O
Substantial reduction inidle CPU power
44
Managing Active Power
•Operating system changes frequency as needed to meet performance needs, minimize power– Enhanced Intel SpeedStep® Technology– Referred to as processor P-States
•PCU tunes voltage for given frequency, operating conditions, and silicon characteristics
PCU automatically optimizes operating voltage
45
Turbo Mode: Key to Scalability Goal
• Intel® Core™ microarchitecture (Nehalem) is a scalable architecture– High frequency core for performance in
less constrained form factors– Retain ability to use that frequency in
very small form factors– Retain ability to use that frequency when
running lightly threaded or lower power workloads
• Turbo utilizes available frequency:– Maximizes both single-thread and multi-
thread performance in the same part
Turbo Mode provides performancewhen you need it
Nehalem
46
Turbo Mode Enabling
•Turbo Mode exposed as additional Enhanced Intel SpeedStep® Technology operating point– Operating system treats as any other P-state, requesting
Turbo Mode when it needs more performance– Performance benefit comes from higher operating
frequency – no need to enable or tune software
•Turbo Mode is transparent to system– Frequency transitions handled completely in hardware– PCU keeps silicon within existing operating limits– Systems designed to same specs, with or without Turbo
Mode
Performance benefits withexisting applications and operating systems
47
Summary
• Intel® Core™ microarchitecture (Nehalem) – The 45nm Tock• Designed for
– Power Efficiency– Scalability– Performance
• Key Innovations:– Enhanced Processor Core– Brand New Platform Architecture– Sophisticated Power Management
High Performance When You Need ItLower Power When You Don’t
48
Q&A
49
Legal Disclaimer• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.• All products, dates, and figures specified are preliminary based on current expectations, and
are subject to change without notice.• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as
errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Merom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.• Copyright © 2008 Intel Corporation.
50
Risk FactorsThis presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today’s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the report on Form 10-Q for the quarter ended June 28, 2008.
51
Backup Slides
52
Intel® Core™ Microarchitecture (Nehalem) Design Goals
Existing Existing AppsAppsEmerging AppsEmerging AppsAll UsagesAll Usages
Single Single ThreadThread
Multi-Multi-threadsthreads
Workstation / Server
Desktop / Mobile
World class performance combined with superior energy efficiency – World class performance combined with superior energy efficiency – Optimized for: Optimized for:
A single, scalable, foundation optimized across each segment and power A single, scalable, foundation optimized across each segment and power envelope envelope
Dynamically scaled performance when
needed to maximize energy efficiency
A Dynamic and Design Scalable A Dynamic and Design Scalable MicroarchitectureMicroarchitecture
5353
Tick-Tock Development ModelTick-Tock Development Model
ForecastForecast
PenrynPenryn NehalemNehalem SandySandyBridgeBridge
WestmereWestmere
NEWNEWMicroarchitectureMicroarchitecture
45nm45nm
NEWNEWMicroarchitectureMicroarchitecture
32nm32nm
NEWNEWProcessProcess
45nm45nm
NEW NEW ProcessProcess
32nm32nm
MeromMerom11
NEWNEWMicroarchitectureMicroarchitecture
65nm65nm
TOCKTOCKTOCKTOCK TICKTICKTICKTICK TOCKTOCKTOCKTOCK TICKTICKTICKTICK TOCKTOCKTOCKTOCK
All dates, product descriptions, availability and plans are forecasts and subject to change without notice.
1Intel® Core™ microarchitecture (formerly Merom)45nm next generation Intel® Core™ microarchitecture (Penryn)Intel® Core™ Microarchitecture (Nehalem)Intel® Microarchitecture (Westmere)Intel® Microarchitecture (Sandy Bridge)
54
Enhanced Processor CoreInstruction Fetch and
Pre Decode
Instruction Queue
Decode
ITLB
Rename/Allocate
Retirement Unit
(ReOrder Buffer)
Reservation Station
Execution Units
DTLB
2nd Level TLB4
4
6
32kB
Instruction Cache
32kB
Data Cache
256kB
2nd Level Cache
L3 and beyond
Front End
Execution
EngineMemory
55
Front-end
•Responsible for feeding the compute engine– Decode instructions– Branch Prediction
•Key Intel® Core™2 microarchitecture Features– 4-wide decode– Macrofusion– Loop Stream Detector
Instruction Fetch and
Pre Decode
Instruction Queue
Decode
ITLB32kB
Instruction Cache
56
Loop Stream Detector Reminder
• Loops are very common in most software• Take advantage of knowledge of loops in HW
– Decoding the same instructions over and over– Making the same branch predictions over and over
• Loop Stream Detector identifies software loops– Stream from Loop Stream Detector instead of normal path– Disable unneeded blocks of logic for power savings– Higher performance by removing instruction fetch limitations
Intel® Core™2 Loop Stream Detector
Branch
PredictionFetch Decode
Loop
Stream
Detector
18
Instructions
57
Branch Prediction Reminder
•Goal: Keep powerful compute engine fed•Options:
– Stall pipeline while determining branch direction/target– Predict branch direction/target and correct if wrong
•Minimize amount of time wasted correcting from incorrect branch predictions– Performance:
– Through higher branch prediction accuracy– Through faster correction when prediction is wrong
– Power efficiency: Minimize number of speculative/incorrect micro-ops that are executed
Continued focus on branch
prediction improvements
58
L2 Branch Predictor
•Problem: Software with a large code footprint not able to fit well in existing branch predictors– Example: Database applications
•Solution: Use multi-level branch prediction scheme•Benefits:
– Higher performance through improved branch prediction accuracy
– Greater power efficiency through less mis-speculation
59
Advanced Renamed Return Stack Buffer (RSB)
• Instruction Reminder– CALL: Entry into functions– RET: Return from functions
•Classical Solution– Return Stack Buffer (RSB) used to predict RET– RSB can be corrupted by speculative path
•The Renamed RSB– No RET mispredicts in the common case
60
Execution Engine
•Responsible for:– Scheduling operations– Executing operations
•Powerful Intel® Core™2 microarchitecture execution engine– Dynamic 4-wide Execution– Intel® Advanced Digital Media Boost
– 128-bit wide SSE
– Super Shuffler (45nm next generation Intel® Core™ microarchitecture (Penryn))
61
Intel® Smart Cache – Core Caches
• New 3-level Cache Hierarchy• 1st level caches
– 32kB Instruction cache– 32kB, 8-way Data Cache
– Support more L1 misses in parallel than Intel® Core™2 microarchitecture
• 2nd level Cache– New cache introduced in Intel® Core™
microarchitecture (Nehalem)– Unified (holds code and data)– 256 kB per core (8-way)– Performance: Very low latency
– 10 cycle load-to-use
– Scalability: As core count increases, reduce pressure on shared cache
Core
256kB
L2 Cache
32kB L1
Data Cache
32kB L1
Inst. Cache
62
New TLB Hierarchy
• Problem: Applications continue to grow in data size• Need to increase TLB size to keep the pace for performance• Nehalem adds new low-latency unified 2nd level TLB
# of Entries
1st Level Instruction TLBs
Small Page (4k) 128
Large Page (2M/4M) 7 per thread
1st Level Data TLBs
Small Page (4k) 64
Large Page (2M/4M) 32
New 2nd Level Unified TLB
Small Page Only 512
63
Fast Unaligned Cache Accesses
• Two flavors of 16-byte SSE loads/stores exist – Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary– Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
• Prior to Intel® Core™ microarchitecture (Nehalem)– Optimized for Aligned instructions– Unaligned instructions slower, lower throughput -- Even for aligned accesses!
– Required multiple uops (not energy efficient)
– Compilers would largely avoid unaligned load – 2-instruction sequence (MOVSD+MOVHPD) was faster
• Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions– Same speed/throughput as Aligned instructions on aligned accesses– Optimizations for making accesses that cross 64-byte boundaries fast
– Lower latency/higher throughput than Core 2
– Aligned instructions remain fast• No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)!• Benefits:
– Compiler can now use unaligned instructions without fear– Higher performance on key media algorithms– More energy efficient than prior implementations
64
Faster Synchronization Primitives
• Multi-threaded software becoming more prevalent
• Scalability of multi-thread applications can be limited by synchronization
• Synchronization primitives: LOCK prefix, XCHG
• Reduce synchronization latency for legacy software
Greater thread scalability with Nehalem
LOCK CMPXCHG Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pentium 4 Core 2 Nehalem
Rel
ativ
e L
aten
cy
1Intel® Pentium® 4 processorIntel® Core™2 Duo processorIntel® Core™ microarchitecture (Nehalem)-based processor
1
65
Intel® Core™ Microarchitecture (Nehalem) SMT Implementation Details
SMT efficient due to
minimal replication of logic
Policy Description Intel® Core™ Microarchitecture
(Nehalem) Examples
Replicated Duplicate logic per thread Register StateRenamed RSBLarge Page ITLB
Partitioned Statically allocated between threads
Load BufferStore BufferReorder BufferSmall Page ITLB
Competitively Shared Depends on thread’s dynamic behavior
Reservation StationCachesData TLB2nd level TLB
Unaware No SMT impact Execution units
66
Feeding the Execution Engine
• Powerful 4-wide dynamic execution engine• Need to keep providing fuel to the execution engine• Intel® Core™ Microarchitecture (Nehalem) Goals
– Low latency to retrieve data– Keep execution engine fed w/o stalling
– High data bandwidth– Handle requests from multiple cores/threads seamlessly
– Scalability– Design for increasing core counts
• Combination of great cache hierarchy and new platform
Intel® Core™ microarchitecture (Nehalem)
designed to feed the execution engine
67
Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 Cache
Data request from Core 0 misses Core 0’s L1 and L2
Request sent to the L3 cache
68
Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core 0 looks up the L3 Cache
Data not in the L3 Cache
MISS! MISS!
69
Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheMISS! MISS!
Must check other cores Guaranteed data is not on-die
Greater scalability from inclusive approach
70
Inclusive vs. Exclusive Caches – Cache Hit
Exclusive Inclusive
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheHIT! HIT!
No need to check other cores Data could be in another core BUT
Intel® CoreTM microarchitecture
(Nehalem) is smart…
71
Inclusive vs. Exclusive Caches – Cache Hit
Inclusive
Core
0
Core
1
Core
2
Core
3
L3 CacheHIT!
Core valid bits limit unnecessary snoops
•Maintain a set of “core valid” bits per cache line in the L3 cache
•Each bit represents a core•If the L1/L2 of a core may
contain the cache line, then core valid bit is set to “1”
•No snoops of cores are needed if no bits are set
•If more than 1 bit is set, line cannot be in Modified state in any core
0 0 0 0
72
Inclusive vs. Exclusive Caches – Read from other core
Exclusive Inclusive
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheMISS! HIT!
Must check all other cores Only need to check the core
whose core valid bit is set
0 0 1 0
73
Local Memory Access• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from its DRAM– CPU0 snoops CPU1 to check if data is present
• Step 2:– DRAM returns data– CPU1 returns snoop response
• Local memory latency is the maximum latency of the two responses• Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies
close to each other
CPU0 CPU1
Intel®
QPIDRAMDRAM
Intel® QPI = Intel® QuickPath Interconnect
74
Remote Memory Access• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1– Request sent over Intel® QuickPath Interconnect (Intel® QPI) to
CPU1– CPU1’s IMC makes request to its DRAM– CPU1 snoops internal caches– Data returned to CPU0 over Intel QPI
• Remote memory latency a function of having a low latency interconnect
CPU0 CPU1
Intel®
QPI DRAMDRAM
75
Hardware Prefetching (HWP)
• HW Prefetching critical to hiding memory latency• Structure of HWPs similar as in Intel® Core™2
microarchitecture– Algorithmic improvements in Intel® Core™ microarchitecture
(Nehalem) for higher performance• L1 Prefetchers
– Based on instruction history and/or load address pattern• L2 Prefetchers
– Prefetches loads/RFOs/code fetches based on address pattern– Intel Core microarchitecture (Nehalem) changes:
– Efficient Prefetch mechanism – Remove the need for Intel® Xeon® processors to disable HWP
– Increase prefetcher aggressiveness– Locks on address streams quicker, adapts to change faster, issues more
prefetchers more aggressively (when appropriate)
76
Today’s Platform Architecture
Front-Side Bus Evolution
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
77
Intel® QuickPath Interconnect
• Intel® Core™ microarchitecture (Nehalem) introduces new Intel® QuickPath Interconnect (Intel® QPI)
• High bandwidth, low latency point to point interconnect
• Up to 6.4 GT/sec initially– 6.4 GT/sec -> 12.8 GB/sec– Bi-directional link -> 25.6
GB/sec per link– Future implementations at even
higher speeds
• Highly scalable for systems with varying # of sockets
NehalemEP
NehalemEP
IOH
memoryCPU CPU
CPU CPU
IOH
memory
memory
memory
Intel® CoreTM microarchitecture (Nehalem-EP)
78
Integrated Memory Controller (IMC)• Memory controller optimized per
market segment• Initial Intel® Core™
microarchitecture (Nehalem) products– Native DDR3 IMC– Up to 3 channels per socket– Massive memory bandwidth– Designed for low latency– Support RDIMM and UDIMM– RAS Features
• Future products– Scalability
– Vary # of memory channels– Increase memory speeds– Buffered and Non-Buffered solutions
– Market specific needs– Higher memory capacity – Integrated graphics
NehalemEP
NehalemEP
Tylersburg EP
DDR3 DDR3
Significant performance through new IMC
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
79
Memory Latency Comparison
• Low memory latency critical to high performance • Design integrated memory controller for low latency• Need to optimize both local and remote memory latency• Intel® Core™ microarchitecture (Nehalem) delivers
– Huge reduction in local memory latency– Even remote memory latency is fast
• Effective memory latency depends per application/OS– Percentage of local vs. remote accesses– Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
Relative Memory Latency Comparison
0.00
0.20
0.40
0.60
0.80
1.00
Harpertow n (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote
Rel
ativ
e M
emo
ry L
aten
cy
1Next generation Quad-Core Intel® Xeon® processor (Harpertown)
Intel® CoreTM microarchitecture (Nehalem)
1
80
Latency of Virtualization Transitions
•Microarchitectural– Huge latency reduction
generation over generation – Nehalem continues the
trend
•Architectural– Virtual Processor ID (VPID)
added in Intel® Core™ microarchitecture (Nehalem)
– Removes need to flush TLBs on transitions
Higher Virtualization Performance Through
Lower Transition Latencies
1Intel® Core™ microarchitecture (formerly Merom)45nm next generation Intel® Core™ microarchitecture (Penryn)Intel® Core™ microarchitecture (Nehalem)
0%
20%
40%
60%
80%
100%
Rel
ativ
e La
ten
cy
Merom Penryn Nehalem
Round Trip Virtualization Latency
1
81
Extended Page Tables (EPT) Motivation
Guest OS
VM1
VMM
CR3
Guest Page Table
CR3
Active Page Table
• A VMM needs to protect physical memory• Multiple Guest OSs share
the same physical memory
• Protections are implemented through page-table virtualization
• Page table virtualization accounts for a significant portion of virtualization overheads• VM Exits / Entries
• The goal of EPT is to reduce these overheads
Guest page table changes cause exits into the VMM
VMM maintains the active page table, which is used
by the CPU
82
STTNI - STring & Text New InstructionsOperates on strings of bytes or words (16b)
Equal Each Instruction
True for each character in Src2 if same position in Src1 is equalSrc1: Test\tdaySrc2: tad tseTMask: 01101111
Equal Ordered Instruction
Finds the start of a substring (Src1) within another string (Src2)Src1: ABCA0XYZSrc2: S0BACBABMask: 00000010
Equal Any Instruction
True for each character in Src2 if any character in Src1 matchesSrc1: Example\nSrc2: atad tsTMask: 10100000
Ranges Instruction
True if a character in Src2 is in at least one of up to 8 ranges in Src1Src1: AZ’0’9zzzSrc2: taD tseTMask: 00100001
Projected 3.8x kernel speedup on XML parsing & 2.7x savings on instruction cycles
STTNI MODEL
x x xx xxx Tx x xx Txx x
x T xx xxx x
x x xx xTx x
x x Fx xxx xx x xx xxT x
x x xT xxx xF x xx xxx x
t da st TeT
st
e
ad\t
y
Check each bit in the diagonal
Sou
rce1
(X
MM
)
Source2 (XMM / M128)
IntRes1
0 1 01 111 1
Bit 0
Bit 0
83
STTNI Model
AND the results
along each
diagonal
x x xx xxx Tx x xx Txx x
x T xx xxx x
x x xx xTx x
x x Fx xxx xx x xx xxT x
x x xT xxx xF x xx xxx x
t da st TeT
st
e
ad\t
y
Check each bit in the diagonal
Sou
rce1
(X
MM
)
Source2 (XMM / M128)
IntRes10 1 01 111 1
Bit 0
Bit 0
F F FF FF FFF F FF FFF F
F F FF FFF F
T T FF FFF F
F F FF FFF FF F FF FFF F
F F FF FFF FF F FF FFF F
a a dt t TsE
am
x
elp
\n
OR results down each column
Sou
rce1
(X
MM
)Source2 (XMM / M128)
IntRes11 1 00 000 0
Bit 0
Bit 0
EQUAL ANY EQUAL EACH
Sour
ce1
(XM
M)
Bit 0fF F TfF TFF FfF T FfF FTF xfF F FfF xFT xfF F TfF xxF xfT fTfTfT xxx xfT fT xfT xxx xfT x xfT xxx xfT x xx xxx x
Source2 (XMM / M128)
S BA0 ABC BA
CA
B
YX0
ZIntRes10 0 00 100 0
Bit 0
F T FF FF TTF TTF FFF T
T TTT TTT T
T T FT TTT T
F F FF FFF FF F TF FFF F
F F FF FFF FT TTT TTT T
t Da st TeA
‘0’9
Z
zzz
z
Sou
rce1
(X
MM
)
Source2 (XMM / M128)
IntRes10 100 000 1
Bit 0
•First Compare
does GE, next
does LE
•AND GE/LE pairs of results
•OR those results
Bit 0
EQUAL ORDEREDRANGES
AND the results
along each
diagonal
Bit 0
Sou
rce1
(X
MM
)
84
ATA - Application Targeted Accelerators
CRC32 POPCNT
One register maintains the running CRC value as a software loop iterates over data. Fixed CRC polynomial = 11EDC6F41h
Replaces complex instruction sequences for CRC in Upper layer data protocols:
• iSCSI, RDMA, SCTP
SRC Data 8/16/32/64 bit
Old CRC
63 3132 0
0 New CRC
63 3132 0
DST
DST
0
X
Accumulates a CRC32 value using the iSCSI polynomial
Enables enterprise class data assurance with high data rates in networked storage in any user environment.
0 1 0 . . . 0 0 1 1
63 1 0 Bit
0x3
RAX
RBX
0 ZF=? 0
POPCNT determines the number of nonzero
bits in the source.
POPCNT is useful for speeding up fast matching in data mining workloads including:• DNA/Genome Matching• Voice Recognition
ZFlag set if result is zero. All other flags (C,S,O,A,P) reset
85
Tools Support of New Instructions• Intel® Compiler 10.x supports the new instructions
– Nehalem specific compiler optimizations– SSE4.2 supported via vectorization and intrinsics – Inline assembly supported on both IA-32 and Intel® 64 architecture targets– Necessary to include required header files in order to access intrinsics
• Intel® XML Software Suite– High performance C++ and Java runtime libraries– Version 1.0 (C++), version 1.01 (Java) available now– Version 1.1 w/SSE4.2 optimizations planned for September 2008
• Microsoft Visual Studio* 2008 VC++– SSE4.2 supported via intrinsics– Inline assembly supported on IA-32 only– Necessary to include required header files in order to access intrinsics– VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions
• Sun Studio Express* 7/08– Supports Intel® CoreTM microarchitecture (Merom), 45nm next generation Intel® Core™ microarchitecture (Penryn), Intel® CoreTM
microarchitecture (Nehalem)– SSE4.1, SSE4.2 through intrinsics– Nehalem specific compiler optimizations
• GCC* 4.3.1– Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core
microarchitecture (Nehalem)– via –mtune=generic. – Support SSE4.1 and SSE4.2 through vectorizer and intrinsics
Broad Software Support for Intel®Core™ Microarchitecture (Nehalem)
86
Software Optimization Guidelines
•Most optimizations for Intel® Core™
microarchitecture still hold•Examples of new optimization guidelines:
– 16-byte unaligned loads/stores– Enhanced macrofusion rules– NUMA optimizations
• Intel® Core™ microarchitecture (Nehalem) SW Optimization Guide will be published
• Intel® Compiler will support settings for Intel Core microarchitecture (Nehalem) optimizations
87
Example Code For strlen()
int sttni_strlen(const char * src){
char eom_vals[32] = {1, 255, 0};
__asm{
mov eax, src
movdqu xmm2, eom_vals
xor ecx, ecx
topofloop:
add eax, ecx
movdqu xmm1, OWORD PTR[eax]
pcmpistri xmm2, xmm1, imm8
jnz topofloop
endofstring:
add eax, ecx
sub eax, srcret
}
}
string equ [esp + 4] mov ecx,string ; ecx -> string test ecx,3 ; test if string is aligned on 32 bits je short main_loopstr_misaligned: ; simple byte loop until string is aligned mov al,byte ptr [ecx] add ecx,1 test al,al je short byte_3 test ecx,3 jne short str_misaligned add eax,dword ptr 0 ; 5 byte nop to align label below align 16 ; should be redundantmain_loop: mov eax,dword ptr [ecx] ; read 4 bytes mov edx,7efefeffh add edx,eax xor eax,-1 xor eax,edx add ecx,4 test eax,81010100h je short main_loop ; found zero byte in the loop mov eax,[ecx - 4] test al,al ; is it byte 0 je short byte_0 test ah,ah ; is it byte 1 je short byte_1 test eax,00ff0000h ; is it byte 2
je short byte_2 test eax,0ff000000h ; is it byte 3 je short byte_3 jmp short main_loop ; taken if bits 24-30 are clear and bit; 31 is setbyte_3: lea eax,[ecx - 1] mov ecx,string sub eax,ecx retbyte_2: lea eax,[ecx - 2] mov ecx,string sub eax,ecx retbyte_1: lea eax,[ecx - 3] mov ecx,string sub eax,ecx retbyte_0: lea eax,[ecx - 4] mov ecx,string sub eax,ecx retstrlen endp end
STTNI Version
Current Code: Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions
STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4 instructions
88
CRC32 Preliminary Performance
crc32c_sse42_optimized_version(uint32 crc, unsigned
char const *p, size_t len)
{ // Assuming len is a multiple of 0x10
asm("pusha");
asm("mov %0, %%eax" :: "m" (crc));
asm("mov %0, %%ebx" :: "m" (p));
asm("mov %0, %%ecx" :: "m" (len));
asm("1:");
// Processing four byte at a time: Unrolled four times:
asm("crc32 %eax, 0x0(%ebx)");
asm("crc32 %eax, 0x4(%ebx)");
asm("crc32 %eax, 0x8(%ebx)");
asm("crc32 %eax, 0xc(%ebx)");
asm("add $0x10, %ebx")2;
asm("sub $0x10, %ecx");
asm("jecxz 2f");
asm("jmp 1b");
asm("2:");
asm("mov %%eax, %0" : "=m" (crc));
asm("popa");
return crc;
}}
Preliminary tests involved Kernel code implementing CRC algorithms commonly used by iSCSI drivers.
32-bit and 64-bit versions of the Kernel under test
32-bit version processes 4 bytes of data using 1 CRC32 instruction
64-bit version processes 8 bytes of data using 1 CRC32 instruction
Input strings of sizes 48 bytes and 4KB used for the test
32 - bit 64 - bit
Input Data Size = 48 bytes
6.53 X 9.85 X
Input Data Size = 4 KB
9.3 X 18.63 X
CRC32 optimized Code
Preliminary Results show CRC32 instruction out-performing the fastest CRC32C software algorithm by a big margin
89
Idle Power Matters
•Data center operating costs1
– 41M physical servers by 2010, average utilization < 10%– $0.50 spent on power and cooling for every $1 spent on
server hardware
•Regulatory requirements affect all segments– ENERGY STAR* and related requirements
•Environmental responsibility
Idle power consumption not just mobile concern
1. IDC’s Datacenter Trends Survey, January 2007
90
CPU Core Power Consumption
• High frequency processes are leaky– Reduced via high-K metal gate process,
design technologies, manufacturing optimizations
Leakage
91
CPU Core Power Consumption
• High frequency designs require high performance global clock distribution
• High frequency processes are leaky– Reduced via high-K metal gate process,
design technologies, manufacturing optimizations
Leakage
Clock Distribution
92
CPU Core Power Consumption
• Remaining power in logic, local clocks– Power efficient microarchitecture, good
clock gating minimize waste
• High frequency designs require high performance global clock distribution
• High frequency processes are leaky– Reduced via high-K metal gate process,
design technologies, manufacturing optimizations
Leakage
Clock Distribution
Local Clocks
and Logic
Total Core Power Consumption
Challenge – Minimize power when idle
93
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
•C0: CPU active state
Leakage
Clock Distribution
Local Clocks
and Logic
Active Core Power
94
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
•C0: CPU active state•C1, C2 states (early 1990s):
• Stop core pipeline• Stop most core clocks
Leakage
Clock Distribution
Local Clocks
and Logic
Active Core Power
95
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
•C0: CPU active state•C1, C2 states (early 1990s):
• Stop core pipeline• Stop most core clocks
•C3 state (mid 1990s):• Stop remaining core clocks
Leakage
Clock Distribution
Active Core Power
96
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
•C0: CPU active state•C1, C2 states (early 1990s):
• Stop core pipeline• Stop most core clocks
•C3 state (mid 1990s):• Stop remaining core clocks
•C4, C5, C6 states (mid 2000s):• Drop core voltage, reducing leakage• Voltage reduction via shared VR
Leakage
Existing C-states significantly reduce idle power
Active Core Power
97
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
•Cores share a single voltage plane– All cores must be idle before voltage reduced– Independent VR’s per core prohibitive from cost and form
factor perspective
•Deepest C-states have relatively long exit latencies– System / VR handshake, ramp voltage, restore state,
restart pipeline, etc.
Deepest C-states available in mobile products
98
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
•C0: CPU active state
Leakage
Clock Distribution
Local Clocks
and Logic
Active Core Power
99
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
•C0: CPU active state•C1 state:
• Stop core pipeline• Stop most core clocks
Leakage
Clock Distribution
Local Clocks
and Logic
Active Core Power
100
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
•C0: CPU active state•C1 state:
• Stop core pipeline• Stop most core clocks
•C3 state:• Stop remaining core clocks
Leakage
Clock Distribution
Active Core Power
101
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
•C0: CPU active state•C1 state:
• Stop core pipeline• Stop most core clocks
•C3 state:• Stop remaining core clocks
•C6 state:• Processor saves architectural state• Turn off power gate, eliminating
leakage
Leakage
Core idle power goes to ~0
Active Core Power
102
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
103
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0
Cores running Cores running applications.applications.
104
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0
Task completes. No Task completes. No work waiting. OS work waiting. OS executes MWAIT(C6) executes MWAIT(C6) instruction.instruction.
105
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0
Execution stops. Core Execution stops. Core architectural state architectural state saved. Core clocks saved. Core clocks stopped. Core 0 stopped. Core 0 continues execution continues execution undisturbed.undisturbed.
106
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0 Task completes. No Task completes. No work waiting. OS work waiting. OS executes MWAIT(C6) executes MWAIT(C6) instruction. Core instruction. Core enters C6.enters C6.
107
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0VR voltage VR voltage reduced. reduced. Power drops.Power drops.
108
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0
Interrupt for Core 1 arrives. Interrupt for Core 1 arrives. VR voltage increased. Core 1 VR voltage increased. Core 1 clocks turn on, core state clocks turn on, core state restored, and core resumes restored, and core resumes execution at instruction execution at instruction following MWAIT(C6). Cores 0 following MWAIT(C6). Cores 0 remains idle.remains idle.
109
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Time
Core 0
Core 1
Core Power
0
0
C6 significantly reduces idle power consumption
Interrupt for Core 0 arrives. Core Interrupt for Core 0 arrives. Core 0 returns to C0 and resumes 0 returns to C0 and resumes execution at instruction following execution at instruction following MWAIT(C6). Core 1 continues MWAIT(C6). Core 1 continues execution undisturbed.execution undisturbed.
110
Reducing Platform Idle Power•Dramatic improvements in CPU idle power increase
importance of platform improvements•Memory power:
– Memory clocks stopped between requests at low utilization– Memory to self refresh in package C3, C6
•Link power:– Intel® QuickPath Interconnect links to lower power states
as CPU becomes less active– PCI Express* links on chipset have similar behavior
•Hint to VR to reduce phases during periods of low current demand
Intel® Core™ microarchitecture (Nehalem)reduces CPU and platform power
111111
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem): Integrated Power Gate(Nehalem): Integrated Power Gate
• Integrated power switch Integrated power switch between VR output and between VR output and core voltage supplycore voltage supply
– Very low on-resistanceVery low on-resistance– Very high off-resistanceVery high off-resistance– Much faster voltage ramp Much faster voltage ramp
than external VRthan external VR
• Enables per core C6 stateEnables per core C6 state– Individual cores transition to Individual cores transition to
~0 power state~0 power state– Transparent to other cores, Transparent to other cores,
platform, software, and VRplatform, software, and VR
Core0 Core1 Core2 Core3
Memory System, Cache, I/O VTT
VCC
Close collaboration with process technologyto optimize device characteristics
112
• Intel® Core™ microarchitecture (Nehalem) power management overview
• Minimizing idle power consumption• Performance when you need it
Agenda
113113
Turbo Mode Before IntelTurbo Mode Before Intel®® Core™ Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)Core
0C
ore
1
Core
0C
ore
1
Clock StoppedClock Stopped
Power reduction in Power reduction in inactive coresinactive cores
Workload Lightly Threaded
114114
Turbo Mode Before IntelTurbo Mode Before Intel®® Core™ Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
Core
0C
ore
1
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional adds additional
performance bins within performance bins within headroomheadroom
Core
0
Clock StoppedClock Stopped
Power reduction in Power reduction in inactive coresinactive cores
Workload Lightly Threaded
115115
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
Fre
qu
en
cy
(F)
No Turbo
Workload Lightly Threadedor < TDP
Fre
qu
en
cy
(F)
Core
2
C
ore
3
C
ore
0
C
ore
1
Core
2
Core
3
Core
0
C
ore
1
Power GatingPower Gating
Zero power for Zero power for inactive coresinactive cores
116116
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional adds additional
performance bins within performance bins within headroomheadroom
Core
0
C
ore
1
Power GatingPower Gating
Zero power for Zero power for inactive coresinactive cores
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
Core
2
Core
3
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
117117
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)
Core
0
C
ore
1
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional adds additional
performance bins within performance bins within headroomheadroom
Power GatingPower Gating
Zero power for Zero power for inactive coresinactive cores
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
Core
2
Core
3
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
118
Core
2
C
ore
3
Core
0
C
ore
1
Core
2
C
ore
3
Core
0
C
ore
1118
Fre
qu
en
cy
(F)
No Turbo
Workload Lightly Threadedor < TDP
Fre
qu
en
cy
(F)
Active cores running Active cores running workloads < TDPworkloads < TDP
C
ore
0
C
ore
1
Core
2
Core
3
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
119119
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional adds additional
performance bins within performance bins within headroomheadroom
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
Core
2
Core
3Active cores running Active cores running
workloads < TDPworkloads < TDP
Core
2
C
ore
3
Core
1
Core
0
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
120
Dynamically Delivering Optimal Performanceand Energy Efficiency
120
Core
0C
ore
1
Core
2
Core
3
Fre
qu
en
cy
(F)
No Turbo
Fre
qu
en
cy
(F)
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional adds additional
performance bins within performance bins within headroomheadroom
Power GatingPower Gating
Zero power for Zero power for inactive coresinactive cores
Workload Lightly Threadedor < TDP
Core
2
C
ore
3
Core
1
Core
0
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
121
Additional Sources of Information on This Topic:• Other Sessions / Chalk Talks / Labs:
– TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of Processors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50)
– DPTS001: High End Desktop Platform Design Overview for the Next Generation Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30)
– NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family: Architectural Insights and Power Management (8/19, 4:00 – 5:50)
– NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem) Family (8/19, 5:50 – 6:30)
– NGMS002: Tuning Your Software for the Next Generation Intel® Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00)
– PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008 / Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel® Dynamic Power Technology (8/19, 3:00 – 3:50)
– PWRS005: Platform Power Management Options for Intel® Next Generation Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30)
– SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 – 12:00)
122
Session Presentations - PDFs
The PDF for this Session presentation is available from our IDF Content Catalog at the end of the day at:
www.intel.com/idf or
https://intel.wingateweb.com/US08/scheduler/public.jsp
123
Please Fill out the Session Evaluation Form
Place form in evaluation box at the back of session room
Thank you for your input, we use it to improve future Intel Developer Forum events