of 155
8/14/2019 Intel Processor Architecture-Core
1/155
Intel Core Microarchitecture
Intel Software College
8/14/2019 Intel Processor Architecture-Core
2/155
2
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Objectives
After completion of this module you will be able to describe
Components of an IA processor
Working flow of the instruction pipeline
Notable features of the architecture
8/14/2019 Intel Processor Architecture-Core
3/155
3
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
8/14/2019 Intel Processor Architecture-Core
4/155
4
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
8/14/2019 Intel Processor Architecture-Core
5/155
5
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software CollegeIndustrial Recognition
PC Format May 2006PC Format May 2006
Intel Strikes Back!Conroe is the name. Pistol. Pistol--whipping Athlonwhipping Athlon
64s into burger meat is the game..64s into burger meat is the game..
Intel Regains Performance Crown, Anandtech At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performancethan what weve seen here.
Intel Reveals Conroe Architecture, Extremetech And not only was the Intel system running at 2.66GHz a slowerclock rate than the top Pentium 4it was outpacing an overclocked
Athlon 64 FX-60. Wrap your brain around that idea for a bit
Conroe Benchmarks - Intel Showing Big StrengthHot Hardware.com
Intel is poised to change the faceof the desktop computing landscape
Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.comthe results were far more than we could hope for and it'll beamusing to see AMD's response to this beat-down session
Intel's Next Generation Microarchitecture UnveiledIntel's Next Generation Microarchitecture UnveiledReal World Tech
Just as important as the technical innovations in Core MPUs, thismicroarchitecture will have a profound impact on the industry.
8/14/2019 Intel Processor Architecture-Core
6/155
6
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Performance Summary
Intel Core Microarchitecture dramatically boosts Intelplatform performance
Conroe & Woodcrest drive clear Desktop/Server performance
leadership Merom extends Intel Mobile performance leadership
Intel Core Microarchitecture-based platforms set thebar in Performance and Energy Efficiency for the Multi-Core era
Intels 3rd generation dual-core (while competition stuck on 1st
generation)
New Intel high-performance engine: Wider, Smarter, Faster, MoreEfficient
The Core Effect: Intel Core Microarchitectureramp fuels broad roadmap accelerations
Best Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: Energy----Efficient PerformanceEfficient PerformanceEfficient PerformanceEfficient Performance 1111
20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 !
1 Based on SPECint*_rate_base2000
8/14/2019 Intel Processor Architecture-Core
7/155
7
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Agenda
Introduction
Knowledge preparation
Architecture VS Microarchitecture
CISC VS RISC
Performance Measurements
Pipeline Design
Power and Energy
Chip Multi-Processing
Notable features
Micro-architecture tour
Coding considerations
8/14/2019 Intel Processor Architecture-Core
8/155
8
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Architecture and Micro-architecture
What is Computer Architecture?
Architecture is the set of features which are externally visible:
Instruction set
Registers
Addressing modes
Bus protocols
Intel Architectures (IA)
IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture) X87 (Floating Point extension)
MMX (Multi-Media extension)
SSE, SSE2, SSE3 (SIMD Streaming Extension)
Intel 64/EM64T (64-bit Integer extension of IA32) IA64 (Intel new 64-bit architecture)
Itanium/Itainium2 processor family
?? Go to detail!Go to detail!
8/14/2019 Intel Processor Architecture-Core
9/155
8/14/2019 Intel Processor Architecture-Core
10/155
10
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel NetBurstP5 P6 Banias
Intel Architecture History
Architecture:Instruction set definitionand compatibility
EPIC* (Itanium) IA-32 IXA* (XScale)
Microarchitecture:Hardware implementationmaintaining instruction setcompatibility with high-levelarchitecture
Processors:Productizedimplementation ofMicroarchitecture
Examples:
Examples:
Examples:
PentiumPentium ProPro
PentiumPentium II/IIIII/IIIPentiumPentium
PentiumPentium 44
PentiumPentium DD
XeonXeon
PentiumPentium MM
* IXA Intel Internet Exchange Architecture/ EPIC Explicitly Parallel Instruction Computing
8/14/2019 Intel Processor Architecture-Core
11/155
11
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Mobile
Microarchitecture
Intel NetBurst
+ New Innovations
Intel Core Microarchitecture Processors
IntelIntelCoreCore 2 Duo/Quad/Extreme processors2 Duo/Quad/Extreme processors
8/14/2019 Intel Processor Architecture-Core
12/155
12
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
RISC Approach to CPU design
Optimize H/W for common basic operations
Fixed instruction length
Shorter Execution Pipeline
Ease of Instruction Level Parallelism Large number of registers
Less memory accesses
Load/Store architecture
Shorter Execution Pipeline
Ease of advancing Loads Branch Hints
Reduce pipeline flush events
Exotic stuff to be implemented in S/W with minimal H/W support
No complex H/W instructions
Handle exceptional conditions in S/WExamples: MIPS, IBM Power and PowerPC, Sun Sparc
Achieve Maximum performance byright partitioning between H/W and S/W
(RISC = Reduced Instruction Set Computers)
8/14/2019 Intel Processor Architecture-Core
13/155
13
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
CISC Approach to CPU design
Rich architecture
Variable length instructions.
Complex addressing modes.
On-chip HW / SW partitioning required
H/W keeps executing simple stuff
Complex instructions are emulated using u-code routinesfrom ROM
More instructions treated as simple as more H/W is available
COMPATIBILITY has some major advantages:
Large (and forever increasing) software base
Code development tools
Expertise
H/W - S/W spiral
Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW
(CISC = Complex Instruction Set Computers)
8/14/2019 Intel Processor Architecture-Core
14/155
14
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Performance is the reciprocal of the Time of execution:
Were:
L = Code Length (# of machine instructions)CPI = Clock cycles Per Instruction
Tc = Clock period (nSecs)
Substitute:
IPC = Instructions Per Cycle = 1/CPI
F = Frequency = 1/Tc
CTCPILExecutionofTimeePerformanc
**
1
__
1=
L
FIPCePerformanc
*
Improve Timing
Arch Enhancements
Improve ILP
Performance Measurement
8/14/2019 Intel Processor Architecture-Core
15/155
15
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Performance Measurement (cont.)
Performance considerations:
Which Code/Application to run?
Which OS?
Which other components in the
platform? Under which thermal conditions?
Multithreading? Multiprocessing?
Benchmarks examples
Industry Standard
Spec (ISPEC, FSPEC)
TPC
Commercial
SysMark MobileMark
PCMark
Sandra
ScienceMark
Applications
Video (Windows Media encoder, DivX)
Audio (Lame MP3)
Compression (RAR)
Content creation (3DSM, Photoshop, Premiere
Latest Games (Doom III, FarCry, but changesfast)
Specific industries use specific benchmarks
Linux compilation, POVRay, LinPack, lmbench
8/14/2019 Intel Processor Architecture-Core
16/155
16
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Design Considerations for Different
Market SegmentsConstrains:
Thermally, area constrained Desktop
Unconstrained Extreme
Very area constrained Value
Thermally, Energy and Area constrained Mobile
Thermally, Energy Servers
Micro-architecture is the Art of Tradeoffs between:
Schedule
Requirements / Standards
Performance
Features
Power / Energy
Area / Cost
8/14/2019 Intel Processor Architecture-Core
17/155
17
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Design Metrics
IPC = Instructions per Cycle
The more the better
Latency same as Response Time The time interval between
when any request for data is made and
when the data transfer completes
The less the better
Throughput
The amount of work completed by the system per unit of time.
The more the better ops/sec
8/14/2019 Intel Processor Architecture-Core
18/155
18
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
CPU Pipeline
Break the work to smaller pieces
Four basic stages of instruction life
Fetch - bring instruction to core
Decode - read operands from register
Execute - perform the operation
Writeback - save result to register
Execution timing of simple instructions(legend: op src1,src2 dst)
add eax, ebx eax F D E W
sub ecx, edx ecx F D E W
Increased throughput increased number of completed instructions per cycle
8/14/2019 Intel Processor Architecture-Core
19/155
19
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Pipeline Design - Explore Parallelism
New instruction not always depends on previous one
Can start new instruction before previous one is finished
...if different stages use different H/W resources
Run instructions in parallel (pipeline)
Add eax, ebx eax F D E WSub ecx, edx ecx F D E W
Or edi, esi edi F D E W
Need to balance pipe stages
Each stage should take same time for best throughput and utilization
ExecDecodeFetch WB
Clock cycle is determinedby the longest path!
ExecDecodeFetch WB
ExecDecodeFetch WB
ExecDecodeFetch WB
8/14/2019 Intel Processor Architecture-Core
20/155
20
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Pipeline Design Fighting Stalls
Data flow dependency (instructions output/input)
Solved by bypasses, renaming etc
Control flow dependencies
Solved by branch prediction
Others (Cache misses, long latency instructions)
Solved by other dynamic scheduling techniques
?? Go to detail!Go to detail!
8/14/2019 Intel Processor Architecture-Core
21/155
21
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Race of CISC vs. RISC
In modern CPUs Advanced -Architecture Techniques minimize theadvantages of RISC over CISC
Branch Prediction
Reduces the effect of extra pipeline stages
Register Renaming
Effectively Increase the Number of Registers
Out Of Order
Reduce Number of stalls caused by shortage of registers
Speculative Execution
Further Reduce Number of stalls
Power saving features Reduce the overhead when not needed.
8/14/2019 Intel Processor Architecture-Core
22/155
22
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
op Intels Take of the CICS/RISC Race
(CISC) Instructions are translated into one or more (RISC)uop(micro-operation)s
Fixed format Wide and simple
Temp registers
Usually one uop per instructionComplex instruction can be thousands of uops
Stores divided into two uops (STA and STD)
Fusion play games here
8/14/2019 Intel Processor Architecture-Core
23/155
23
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Power and Energy
Maximum power (TDP):
Cooling requirements
Cooling solution
Computer form factor and acoustic noise
Average power
Battery life
Electricity bill
General calculation:
P = frequency * voltage^2 * activity factor * capacitance + leakage
Reducing TDP
Less transistors and wires Smaller transistors and wires
Power features less activity
Low leakage transistors
Reducing average power
Energy efficiency
Power states Lower leakage
8/14/2019 Intel Processor Architecture-Core
24/155
24
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Dual/Multi Core and SMT
Put more than one core per package
Architectural change:
Software must be multi-threaded or multi-process
but backward compatible with multiprocessor systems (MP)
Several ways of implementing it
All of them being used
Core
LLC
I/O
Core
LLC
I/O
Core
LLC
I/O
Core
LLC
Core
LLCI/O
Core
SMT: Run two (or more) threads on the same core, simultaneously
8/14/2019 Intel Processor Architecture-Core
25/155
25
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel Approach
While single core performance has increased due to clock speed,increased cache and improved ILP the biggest performance increases
have come from the thread level parallelism.
While single core performance has increased due to clock speed,increased cache and improved ILP the biggest performance increases
have come from the thread level parallelism.
1 Threads1 Threads
IntelIntel
PentiumPentium
2 Threads2 Threads
IntelIntel
PentiumPentium
With HTWith HT
IntelIntel
PentiumPentiumDDProcessorProcessor
2 Threads2 Threads
4 Threads4 Threads
2 Threads2 Threads
IntelIntel
Core 2 DuoCore 2 Duo
IntelIntel
XQ6700*XQ6700*
Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006
StateExecution UnitsCacheBus
80 Threads80 Threads
?
8/14/2019 Intel Processor Architecture-Core
26/155
26
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
A Acronym Cheat Sheet of Parallel
ComputingCMP: Chip Multi Processor (two or more cores per package)
Dual Core: two cores in same package
Quad Core: four cores in same packageDP: Dual Processor (two packages)
MP: Multi Processor (four or more packages)
SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)
l S f C ll
8/14/2019 Intel Processor Architecture-Core
27/155
27
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Agenda
Introduction
Knowledge preparation
Notable features
Wide Dynamic Execution
Smart Memory Access
Advanced Smart Cache Advanced Digital Media Boost
Intelligent Power Capability
Micro-architecture tour
Coding considerations
I t l S ft C ll
8/14/2019 Intel Processor Architecture-Core
28/155
28
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel Core Micro-architecture Notable
FeaturesIntel Wide Dynamic Execution
14-stage efficient pipeline Wider execution path
Advanced branch prediction Macro-fusion
Roughly ~15% of all instructions areconditional branches
Macro-fusion fuses a comparisonand jump to reduce micro-ops
running down the pipeline Micro-fusion
Merges the load and operationmicro-ops into one macro-op
64-Bit Support
Merom, Conroe, and Woodcrestsupport EM64T
2M/4M
shared L2Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction FetchInstruction Fetchandand PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Intel Software College
8/14/2019 Intel Processor Architecture-Core
29/155
29
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel Core Micro-architecture Notable
Features (cont.)Intel Advanced Memory Access
Improved prefetching
Memory disambiguation Advance load before a possible data dependency (pointer conflict)
Earlier loads hide memory latencies
Intel Software College
8/14/2019 Intel Processor Architecture-Core
30/155
30
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel Core Micro-architecture Notable
Features (cont.)Intel Advanced Smart Cache
Multi-core optimization
Shared between the two cores Advanced Transfer Cache architecture
Reduced bus traffic
Both cores have full access to the entire cache
Dynamic Cache sizing
Intel Software College
8/14/2019 Intel Processor Architecture-Core
31/155
31
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Intel Core Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache
CPU1 CPU2
Memory
Front Side Bus (FSB)
Cache Line
Shipping L2 Cache Line~Half access to memory
Intel Software College
8/14/2019 Intel Processor Architecture-Core
32/155
32
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
g
CPU2
Intel Core Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache (cont.)
CPU1
Memory
Front Side Bus (FSB)
Cache Line
L2 is shared:
No need to ship cacheline
Intel Software College
8/14/2019 Intel Processor Architecture-Core
33/155
33
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Core Micro-architecture Notable
Features (cont.)Intel Advanced Digital Media Boost
Single Cycle SIMD Operation
8 Single Precision Flops/cycle 4 Double Precision Flops/cycle
Wide Operations
128-bit packed Add
128-bit packed Multiply 128-bit packed Load
128-bit packed Store
Support for Intel EM64T
instructions
CoreCore archarch
PreviousPrevious
X4X4
Y4Y4
X4opY4X4opY4
SOURCESOURCE
X1opY1X1opY1
X3X3
Y3Y3
X3opY3X3opY3
X2X2
Y2Y2
X2opY2X2opY2
X1X1
Y1Y1
X1opY1X1opY1
DESTDEST
SSE/2/3 OPSSE/2/3 OP
X2opY2X2opY2
X3opY3X3opY3X4opY4X4opY4
CLOCKCLOCK
CYCLE 1CYCLE 1
CLOCKCLOCK
CYCLE 2CYCLE 2
00127127
CLOCKCLOCK
CYCLE 1CYCLE 1
SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)
Intel Software College
8/14/2019 Intel Processor Architecture-Core
34/155
34
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Core Micro-architecture NotableFeatures
Intel Advanced Digital Media Boost
Additional Media Instructions - Supplemental Streaming SIMDExtensions 3 (SSSE3)
16 new packed integer instructions
Targeting video encode/decode
Significantly improved strings
REP MOVS and REP STOS ~8 bytes / cycle throughput
mileage may vary
Intel Software College
8/14/2019 Intel Processor Architecture-Core
35/155
35
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Core Micro-architecture NotableFeatures
Intel Advanced Digital Media Boost
Supplemental SSE-3 (SSSE-3)
Packed SIGN
Packed Shuffle Bytes
Packed multiply High withRound and Scale
Multiply and Add PackedSigned/Unsigned bytes
Packed Align Right
Packed Absolute Values
Horizontal Addition/Subtraction
PSIGNB/W/D
PSHUFB
PMULHRSW
PALIGNR
PMADDUBSW
PABSB, PABSW, PABSD
PHADDW, PHADDSW, PHADDD,
PHSUBW, PHSUBSW, PHSUBD
Intel Software College
8/14/2019 Intel Processor Architecture-Core
36/155
36
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Core Micro-architecture Notable
Features (cont.)Intelligent Power Capability
Advanced power gating & Dynamic power coordination
Multi-point demand-based switching Voltage-Frequency switching separation
Supports transitions to deeper sleep modes
Event blocking
Clock partitioning and recovery Dynamic Bus Parking
During periods of high performance execution, many parts of thechip core can be shut off
Intel Software College
8/14/2019 Intel Processor Architecture-Core
37/155
37
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Front End
Out-Of-Order Execution Core
Memory Sub-system
Coding considerations
Intel Software College
8/14/2019 Intel Processor Architecture-Core
38/155
38
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Core Micro-architecture Drill-down
icachebranch
prediction
unit
instructionqueue
MS
instructiondecode
predecode
registeralias table
ALLOC Re-Order Buffer
ReservationStation
integer
FPSIMD(3x)
load
storeaddress
store
data
memoryorderbuffer
datacacheunit
page miss handler
Intel Software College
d
8/14/2019 Intel Processor Architecture-Core
39/155
39
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Introduction
Knowledge refreshment
Notable features
Micro-architecture tour
Front End
Out-Of-Order Execution Core
Memory Sub-system
Coding considerations
Intel Software College
C Mi hit t F t E d
8/14/2019 Intel Processor Architecture-Core
40/155
40
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Core Micro-architecture Front End
Instruction preparation before executed
Instruction Fetch Unit
Instruction Queue Instruction Decode Unit
Branch Prediction Unit
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
Intel Software College
I t ti QIntel Core Microarchitecture Front End
8/14/2019 Intel Processor Architecture-Core
41/155
41
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Queue
Buffer between instruction pre-decode unit and decoder
up to six predecoded instructions written per cycle
18 Instructions contained in IQ up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
Re-use of decoded instruction
Potential power saving
Intel Software College
Mac o F sionIntel Core Microarchitecture Front End
8/14/2019 Intel Processor Architecture-Core
42/155
42
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Macro - Fusion
Roughly ~15% of all instructions areconditional branches.
Macro-fusion merges two instructionsinto a single micro-op, as if the twoinstructions were a single longinstruction.
Enhanced Arithmetic Logic Unit (ALU)for macro-fusion. Each macro-fusedinstruction executes with a singledispatch.
Not supported in EM64T long mode
cmpjae eax, [mem], label
Scheduler
Execution
flags and target to Write back
Branch
Eval
Intel Software CollegeIntel Core Microarchitecture Front End
8/14/2019 Intel Processor Architecture-Core
43/155
43
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Queue
addps xmm0, [EAX+16]
dec0
Cycle 2
Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmp eax, 100000
dec1
dec2
dec3
jge label
movps [EAX+240], xmm0
Macro-Fusion Absent
Read four instructions fromInstruction Queue
Each instruction gets decodedinto separate uops
Enabling Example
for (int i=0; i
8/14/2019 Intel Processor Architecture-Core
44/155
44
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Queue
addps xmm0, [EAX+16]
dec0Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmpjae eax, 100000, label
dec1
dec2dec3
movps [EAX+240], xmm0
Macro-Fusion Presented
Read five Instructions fromInstruction Queue
Send fusable pair to single
decoder
Single uop represents twoinstructions
Enabling Examplefor (unsigned int i=0;i
8/14/2019 Intel Processor Architecture-Core
45/155
45
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the sameMacro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
Intel Software College
Instruction Decode / Micro-Fusion (cont )Intel Core Microarchitecture Front End
8/14/2019 Intel Processor Architecture-Core
46/155
46
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
std xmm0, [eax+240]
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store movps [EAX+240], xmm0
sta eax+240st xmm0, [eax+240]
Intel Software College
Branch Prediction ImprovementsIntel Core Microarchitecture Front End
8/14/2019 Intel Processor Architecture-Core
47/155
47
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Branch Prediction Improvements
Intel Pentium 4 Processor branch predictionPLUS the following two improvements:
Branch miss-predictions reduced by >20%
Indirect Branch Predictor Loop Detector
Intel Software College
Agenda
8/14/2019 Intel Processor Architecture-Core
48/155
48
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Front End
Out-Of-Order Execution Core
Memory Sub-system
Coding considerations
Intel Software College
Core Micro-architecture Execution Core
8/14/2019 Intel Processor Architecture-Core
49/155
49
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Core Micro architecture Execution Core
Accepted decoded u-ops, assign resources,execute and retire u-ops
Renamer
Reservation station (RS)
Issue ports
Execution Unit
integerFP
SIMD
(3x)
load
storeaddress
storedata
registeralias table
ALLOCRe-Order Buffer
ReservationStation
Intel Software College
Execution Core Building BlocksIntel Core Microarchitecture Execution Core
8/14/2019 Intel Processor Architecture-Core
50/155
50
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Execution Core Building Blocks
Ports (number)Ports (number)
2 Load2 Load
3,4 Store3,4 Store
Memory SubMemory Sub--systemsystem
0,1,50,1,5
SIMDSIMD
IntegerInteger
SIMD/IntegerSIMD/Integer
MULMUL0,1,50,1,5
IntegerInteger
0,1,50,1,5
FloatingFloating
PointPoint
Execution UnitExecution UnitROBROB
RenamerRenamer
RSRS
Intel Software College
Issue Ports and Execution UnitsIntel Core Microarchitecture Execution Core
8/14/2019 Intel Processor Architecture-Core
51/155
51
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Issue Ports and Execution Units
6 dispatch ports from RS 3 execution ports
(shared for integer / fp / simd)
load
store (address)
store (data)
128-bit SSE implementation
Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
Port 1 has packed add (3 cycles all precisions)
Intel Software College
Retirement UnitIntel Core Microarchitecture Execution Core
8/14/2019 Intel Processor Architecture-Core
52/155
52
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
ReOrder Buffer (ROB)
Holds micro-ops in various stages of completion
Buffers completed micro-ops updates the architectural state in order
manages ordering of exceptions
registeralias table
ALLOC Re-Order Buffer
ReservationStation
Intel Software College
Agenda
8/14/2019 Intel Processor Architecture-Core
53/155
53
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
g
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Front End
Out-Of-Order Execution Core
Memory Sub-system
Coding considerations
Intel Software College
Core Micro-architecture Memory Sub-
8/14/2019 Intel Processor Architecture-Core
54/155
54
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
ySystem
Memory Ordering Buffer
Store Address Buffer Stores the address of each store not actually performed
Loads compare address to any store older than itself If it find a hole
Store Data Buffer Stores data of each store not actually performed If load hit on the SAB, it forward the data from here
Load Buffer Stores address of non-retired loads For snoops and re-dispatch
One 128-bit load and one 128-bit store per cycle to different
memory locations Out of order Memory operations
Intel Software College
Core Micro-architecture Memory Sub-
Intel Core Microarchitecture Memory Sub-system
8/14/2019 Intel Processor Architecture-Core
55/155
55
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Core Micro-architecture Memory Sub-
System (cont.)32k D-Cache (8-way, 64 byte line size)
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache to cache transfer
improves producer / consumer style MP
Wider interface to L2
reduced interference
processor line fill is 2 cycles
Higher bandwidth from the L2 cache to the core
~14 clock latency and 2 clock throughput
Load & Store Access order1. L1 cache of immediate core
2. L1 cache of the other core
3. L2 cache
4. Memory
BusBusBusBusBusBusBusBus
2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache
Core1Core1Core1Core1Core1Core1Core1Core1 Core2Core2Core2Core2Core2Core2Core2Core2
8/14/2019 Intel Processor Architecture-Core
56/155
Intel Software College
Advanced Memory Access / Enhanced DataIntel Core Microarchitecture Memory Sub-system
8/14/2019 Intel Processor Architecture-Core
57/155
57Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Pre-fetch Logic (cont.) L1D cache prefetching
Data Cache Unit Prefetcher Known as the streaming prefetcher Recognizes ascending access patterns in recently loaded data Prefetches the next line into the processors cache
Instruction Based Stride Prefetcher Prefetches based upon a load having a regular stride Can prefetch forward or backward 2 Kbytes
1/2 default page size
L2 cache prefetching: Data Prefetch Logic (DPL) Prefetches data to the 2nd level cache before the DCU requests
the data Maintains 2 tables for tracking loads
Upstream 16 entries Downstream 4 entries
Every load is either found in the DPL or generates a new entry Upon recognition of the 2nd load of a stream the DPL will
prefetch the next load
Intel Software College
Advanced Memory Access / MemoryIntel Core Microarchitecture Memory Sub-system
8/14/2019 Intel Processor Architecture-Core
58/155
58Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Disambiguation
Memory Disambiguation predictor
Loads that are predicted NOT to forward from preceding storeare allowed to schedule as early as possible
increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement
Extension to existing coherency mechanism
Invisible to software and system
Intel Software College
Advanced Memory Access / MemoryIntel Core Microarchitecture Memory Sub-system
8/14/2019 Intel Processor Architecture-Core
59/155
59Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Disambiguation Absent
Load4 must WAIT until previous stores complete
Memory
Data Y
Data Z
Data W
Data X
Load2 Y
Store3 W
Store1 Y
Load4 X
8/14/2019 Intel Processor Architecture-Core
60/155
Intel Software College
Advanced Memory Access / StoresF di
Intel Core Microarchitecture Memory Sub-system
8/14/2019 Intel Processor Architecture-Core
61/155
61Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Forwarding
If a load follows a store and reloads the data that the storewrites to memory, the micro-architecture can forward the datadirectly from the store to the load
Memory
Data Y
Load2 Y
Store1 YInternal
Buffers
Intel Software College
Advanced Memory Access / StoresF di Ali d St C
8/14/2019 Intel Processor Architecture-Core
62/155
62Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Forwarding: Aligned Store Cases
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16
load 32 bit load 32 bit load 32 bit load 32 bit
load 64 bit load 64 bit
load 128 bit
store 128 bit
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16
load 32 bit load 32 bit
load 64 bit
store 64 bit
ld 8 ld 8 ld 8 ld 8
load 16 load 16
load 32 bit
store 32 bit
ld 8 ld 8
load 16
store 16
Intel Software College
Advanced Memory Access / StoresForwarding: Unaligned Cases
8/14/2019 Intel Processor Architecture-Core
63/155
63Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Forwarding: Unaligned Cases
Note that unaligned store forward does not occur when the loadcrosses a cache line boundary
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16
load 32 bit load 32 bit
load 64 bit
store 64 bit
ld 8 ld 8 ld 8 ld 8
load 16 load 16
load 32 bit
store 32 bit
ld 8 ld 8
load 16
store 16
ld 8
ld 8 Store forwarded to load
No forwarding: No forwarding if the load
crosses a cache line boundary
Note: Unaligned 128-bit stores
are issued as two 64-bit stores.This provides twoalignments for
store forwarding
Intel Software College
Agenda
8/14/2019 Intel Processor Architecture-Core
64/155
64Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
Intel Software College
Optimizing forInstruction Fetch and PreDecode
8/14/2019 Intel Processor Architecture-Core
65/155
65Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Fetch and PreDecode
Avoid Length Changing Prefixes (LCPs)
Affects instructions with immediate data or offset
Operand Size Override (66H)Address Size Override (67H) [obsolete]
LCPs change the length decoding algorithm increasing theprocessing time from one cycle to six cycles (or eleven cycles
when the instruction spans a 16-byte boundary)
The REX (EM64T) prefix (4xH) is not an LCP
The REX prefix does lengthen the instruction by one byte, so useof the first eight general registers in EM64T is preferred
Intel Software College
Optimizing forInstruction Queue
8/14/2019 Intel Processor Architecture-Core
66/155
66Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Queue
Includes a Loop Stream Detector (LSD)
Potentially very high bandwidth instruction streaming
A number of requirements to make use of the LSD Maximum of 18 instructions in up to four 16-byte packets
No RET instructions (hence, littlepracticaluse for CALLs)
Up to four taken branches allowed
Most effective at 70+ iterations LSD is after PreDecode so there is no added cost for LCPs
Trade-off LSD with conventional loop unrolling
Intel Software College
Optimizing forDecode
8/14/2019 Intel Processor Architecture-Core
67/155
67Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Decode
Decoder issues up to 4 uOps for renaming/ allocation per clock
This creates a trade off between more complex instructionuOps versus multiple simple instruction uOps
For example, a single four uOp instruction is all that can berenamed/allocated in a single clock
In some cases, multiple simple instructions may be a better
choice than a single complex instruction Single uOp instructions allow more decoder flexibility
For example, 4-1-1-1 can be decodedin one clock
However, 2-2-2-1 takes three clocks to decode
Intel Software College
Optimizing forExecution
8/14/2019 Intel Processor Architecture-Core
68/155
68
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Execution
Up to six uOps can be dispatched per clock
Store Data and Store Address dispatch ports are combined onthe block diagram
Up to four results can be written back per clock
Single clock latency operations are best
Differing latency operations can create writeback conflicts
Separate multiple-clock uOps with several single uOp instructions
Typical instructions here: ADC/SBB, RWM, CMOVcc
In some cases, separating a RMW instruction into its piece might befaster (decode and scheduling flexibility)
When equivalent, PS preferred to PD (LCP)
For example, MOVAPS over MOVAPD, XORPS over XORPD
Intel Software College
Optimizing forExecution (cont )
8/14/2019 Intel Processor Architecture-Core
69/155
69
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Execution (cont.)
Bypass register access preferred to register reads
Partial register accesses often lead to stalls
Register size access that conflicts with recent previous register
write Partial XMM updates subject to dependency delays
Partial flag stall can occur, too much higher cost Use TEST instruction between shift and conditional to prevent
Common zeroing instructions (e.g., XOR reg,reg) dont stall
Avoid bypass between execution domains
For example: FP (ADDPS) and logical ops (PAND) on XMMn
Vectorization: careful packing/unpacking sequence
Use MXCSRs FZ and DAZ controls as appropriate
Intel Software College
Optimizing forMemory
8/14/2019 Intel Processor Architecture-Core
70/155
70
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Memory
Software prefetch instructions
Can reach beyond a page boundary (including page walk)
Prefetches only when it completes without an exception
General techniques to help these prefetchers
Organize data in consecutive lines
In general, increasing addresses are more easily prefetched
Intel Software College
Summary
8/14/2019 Intel Processor Architecture-Core
71/155
71
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
What has been covered
Notable features of Core Micro-architecture
Wide Dynamic Execution
Advanced Memory Access
Advanced Smart Cache
Advanced Digital Media Boost
Power Efficient Support
Core Micro-architecture components
Front End
OOO execution core
Memory sub-system
Intel Software College
8/14/2019 Intel Processor Architecture-Core
72/155
72
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Software College
Platform
8/14/2019 Intel Processor Architecture-Core
73/155
73
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel provides most of the siliconon any computer
Classical platform partition
CPU Computation
MCH high speed IO
ICH low speed IO
Graphics speed and memorylatencies will require differentpartition
This presentation focuses on thecore microarchitecture
PCI (IO)SATAUSB
KBRDothers
FSB
FSB
ICH
Legacy & Debug I/O
Core
Core
LLC
MEHD video
PCIeDisplay
PEG
Analog
DMI
DMIMCH
CPU
MEMDDR
TVout
Graphics
Wireless
Intel Software College
Intel 64 = Extending IA-32 to 64 Bit
8/14/2019 Intel Processor Architecture-Core
74/155
74
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Added to Intel XEONAdded to Intel XEONand Pentiumand Pentium4 Processor in 2004; today4 Processor in 2004; todayavailable in all main stream Intel IAavailable in all main stream Intel IA--32 processors32 processorsin particular inin particular in
all processors based on Intelall processors based on IntelCoreCoreArchitectureArchitecture
Additional Registers8-SSE & 8-Gen Purpose
Additional RegistersAdditional Registers
88--SSE & 8SSE & 8--GenGen PurposePurpose
Double Precision (64-bit)Integer Support
Double Precision (64Double Precision (64--bit)bit)
Integer SupportInteger Support
Extended MemoryAddressability
64-Bit Pointers, Registers
Extended MemoryExtended Memory
AddressabilityAddressability6464--Bit Pointers, RegistersBit Pointers, Registers
++ ==With 64With 64--BitBitExtensionExtension
TechnologyTechnology
Intel Software College
Intel 64 - New Modes of Operation
8/14/2019 Intel Processor Architecture-Core
75/155
75
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
16
1616
16
32
32
64
GPRWidt
h
32
32
64
AddrSize
Defaults
32
32
32
OperandSize
No
No
Yes
NewRegs
No
No
Yes
RIPRel.
No
Yes
Yes
64-bit
IP
New Features
No
Legacy 32-
bit or16-bit
OS
Legacy Mode
(IA32 Mode)
NoCompatibility
Mode
Yes
New64-bit
OS
64-bitMode
LongMode
Compilerequired
OSReqd
Mode
Intel Software College
Registers : Extensions and Additions
EIPRIP
8/14/2019 Intel Processor Architecture-Core
76/155
76
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
R8
R9
R10
R11
R12
R13
R14
R15
ESPRSP
EDIRDI
ESIRSI
EBPRBP
EDXRDX
ECXRCX
EBXRBX
EAXRAX
63 32 31 0
XMM15
XMM14
XMM13
XMM12
XMM11
XMM10
XMM9
XMM8
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
EIPRIP
127 64 63 0
079
X87/MMX
Intel Software College
Registers : Availability in different
8/14/2019 Intel Processor Architecture-Core
77/155
77
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
modes
Intel Software College
64-bit Mode of Operation
8/14/2019 Intel Processor Architecture-Core
78/155
78
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Default data size is 32-bits
Override to 64-bits using new REX prefix
All registers are 64-bit, 32-bit, 16-bit and 8-bit addressableREX prefixes
A family of 16 prefixed, encoded 0x40-0x4F
Allows the use of general purpose registers as 64-bits Allows the use of new registers (like r8-r15)
Instructions that set a 32 bit register automatically zero extendthe upper 32-bits
Intel Software College
REX Prefix
8/14/2019 Intel Processor Architecture-Core
79/155
79
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
A new instruction-prefix byte used in 64-bit mode
Specify the new GPRs and SSE registers
Specify a 64-bit operand size.
Specify extended control registers (used by system software)
An instruction can only have one REX prefix and if used, must immediatelyprecede the opcode or the two-byte opcode escape prefix .
The legacy instruction-size limit of 15 bytes still applies to instructions that
contains a REX prefix.
Intel Software College
Physical and Linear Addressing
8/14/2019 Intel Processor Architecture-Core
80/155
80
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linear Addressing
Initial Intel 64 implementation support 48bits of Virtual addressing.
Addresses are required to be in canonicalform bits 47 thru 63 must all be 1 or all be 0.
Physical Addressing
Initial Netburst Intel 64 implementationsupport 36 bit, today all current processorssupport 40bit at least
Entries in page tables expanded for up to 52bits of physical address.
Intel Software College
Intel64 - Large Memory Considerations
8/14/2019 Intel Processor Architecture-Core
81/155
81
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Canonical addressing for 64 bit addresses
Although the architecture now allows calculating flat
addresses to 64 bits, todays processors limit virtualaddressing to 48 bits
Canonical address definition: An address that has addressbit 63 through 47 set to either all ones or all zeros
Canonical addresses are a requirement
Values for addresses that are not canonical will cause faultswhen put into locations expecting a valid address, such assegment registers
ReturnReturn
Intel Software College
Introducing SIMD: Single InstructionMultiple Data
8/14/2019 Intel Processor Architecture-Core
82/155
82
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
++
Scalar processing
traditional mode
one operation produces
one result
SIMD processing
with SSE / SSE2
one operation produces
multiple results
XX
YY
X + YX + Y
++
x3x3 x2x2 x1x1 x0x0
y3y3 y2y2 y1y1 y0y0
x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0
XX
YY
X + YX + Y
Intel Software College
SSE RegistersMMX Technology /IA-INT
X86 Register SetsSSE-Registers introduced first in Pentium 3
8/14/2019 Intel Processor Architecture-Core
83/155
Copyright 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
128
Eight 128Eight 128--bit registersbit registers
Hold data only:Hold data only:
4 x single FP numbers4 x single FP numbers
2 x double FP numbers2 x double FP numbers 128128--bit packed integersbit packed integers
Direct access to the registersDirect access to the registers
Use simultaneously with FP /Use simultaneously with FP /MMX TechnologyMMX Technology
IA-FP Registers
8064
Eight 80/64Eight 80/64--bit registersbit registers
Hold data onlyHold data only
Stack access to FP0..FP7Stack access to FP0..FP7
Direct access to MM0..MM7Direct access to MM0..MM7
No MMXNo MMX Technology / FPTechnology / FPinteroperabilityinteroperability
Registers
32
Fourteen 32Fourteen 32--bit registersbit registers
Scalar data & addressesScalar data & addresses Direct access toDirect access to regsregs
mm0mm0
mm7mm7
xmm0xmm0
xmm7xmm7
st0st0
st7st7
eaxeax
ediedi
Intel Software College
Instruction Set Extensions
8/14/2019 Intel Processor Architecture-Core
84/155
84
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Beginning in 2008: ~50 new instructions in 13 groups
All function in 32-bit and 64-bit modes
Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D &3D Imaging, Vectorizing Compiler Performance
New Instructions Added to Intel Processors
56 70
144
13
32
50
0
20
40
6080
100
120
140
160
Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+
MMX Streaming SIMDExtensions (SSE)
Streaming SIMDExtensions 2 (SSE2)
Streaming SIMDExtensions 3 (SSE3)
Supplemental SSE3(SSSE3)
Future Intel instructionset extensions
350 250 180 90 65 45Process (nm)
~32
Future
SSE-4
45 nm
Intel Software College
SSE and SSE-2 Data Types
8/14/2019 Intel Processor Architecture-Core
85/155
85
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
4x floats4x floatsSSE
16x bytes16x bytes
8x 168x 16--bit shortsbit shorts
4x 324x 32--bit integersbit integers
2x 642x 64--bit integersbit integers
1x 1281x 128--bit(!) integerbit(!) integer
2x doubles2x doubles
SSE-2
Intel Software College
SSE-Instructions Set Extensions
8/14/2019 Intel Processor Architecture-Core
86/155
Copyright 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Introduced by Pentium 3 in 1999; now frequently calledSSE-1
Only new data type supported: 4x32Bit (Single Precision)floating point data
Some 70 instructions
Arithmetic, compare, convert operations on SSE SP FP data PACKED, UNPACKED
Data load/store Prefetch
Extension of MMX
Streaming Store (store without using cache in between)
Intel Software College
SSE Sample: Branch Removal
8/14/2019 Intel Processor Architecture-Core
87/155
87
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
R = (R = (AA
8/14/2019 Intel Processor Architecture-Core
88/155
Copyright 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Introduced by Intel Pentium4 processor in2000
Some 140 new instructions
Added double precision floating point data(2x64Bit) and all related instructions includingconversion
Again some extensions to MMX
Added all possible combinations of integer data toSSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and relatedoperations
Intel Software College
SIMD Single vs. SIMD Double
8/14/2019 Intel Processor Architecture-Core
89/155
Copyright 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
002222232330303131
SIMD SP FP Operand = 4 Elements
Element = SP FP Number
005151525262626363
SIMD DP FP Operand = 2 Elements
Element = DP FP Number
4 x Single Precision:4 x Single Precision:
SSESSE--11
2 x Double Precision:2 x Double Precision:
SSESSE--22
X3X3 X2X2 X1X1 X0X0
SS ExponentExponent SignificandSignificand
X1X1 X0X0
SS ExponentExponent SignificandSignificand
00127127
127127 00
Intel Software College
Sample for SSE-2:SIMD Double SIMD Int Conversion
8/14/2019 Intel Processor Architecture-Core
90/155
90
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD Double SIMD Int: conversion to two lower ints, twohigher ints cleared
x1x1 x0x0
0000000000 0000000000 (int)x1(int)x1 (int)x0(int)x0
__m128d x;
__m128i ix;
ix = _mm_cvtpd_epi32(x);
???????? ???????? ix1ix1 ix0ix0
(double)x1(double)x1 (double)x0(double)x0
x = _mm_cvtepi32_pd(ix);
SIMDSIMD IntInt SIMD Double: conversion fromSIMD Double: conversion from
two lowertwo lower intintss
Intel Software College
FISTTP
SSE3: No new Data Types but new Instructions
8/14/2019 Intel Processor Architecture-Core
91/155
91
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD FP using AOSformat*
ThreadSynchronization
Video encoding
Complex arithmetic
FP to integerconversions
HADDPD, HSUBPD
HADDPS, HSUBPS
MONITOR, MWAIT
LDDQU
ADDSUBPD, ADDSUBPS,
MOVDDUP, MOVSHDUP,
MOVSLDUP
FISTTP
* Also benefits Complex and Vectorization
Intel Software College
Streaming SIMD Extensions 313 new instructions
8/14/2019 Intel Processor Architecture-Core
92/155
92
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Three have limited use for application performanceimprovement
FISTTP - X87 to integer conversion (requires longdouble switch)
MONITOR/MWAIT - thread synchronization
Available today in Ring 0 only; being used by newer Windows* and Linux*thread packages
The other ten have some potential for specifcapplication domains
Intel Software College
SSE-3 Sample Complex Arithmetic: ADDSUBPS
ADDSUBPS OperandA OperandB
8/14/2019 Intel Processor Architecture-Core
93/155
93
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
ADDSUBPS OperandA OperandB OperandA (xmm register; 4 data elements)
a3, a2, a1, a0
OperandB (xmm reg. Or memory addr; 4 data elements)
b3, b2, b1, b0
Result (Stored in OperandA)
a3+b3, a2-b2, a1+b1, a0-b0
__m128 _mm_addsub_ps(__m128 a, __m128 b)
a3 a2 a1 a0
a3+b3 a2-b2 a1+b1 a0-b0
Add Sub
b3 b2 b1 b0
AddSub
Intel Software College
Sample SSSE-3 Inst.: Byte Permute
PSHUFB mm mm/m64
8/14/2019 Intel Processor Architecture-Core
94/155
94
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
PSHUFB mm, mm/m64PSHUFB xmm, xmm/m128
A complete byte-granularity permutation
The source operand is used as the control field (variable control)
The destination operand gets permuted Each byte of the source field selects the origin of the corresponding
destination byte
Also includes force-byte-to-zero flag (bit 7)
0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01
0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00
0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01
srcsrc
destdest
destdest
Intel Software College
Ways to SSE/SIMD programming
Coding using SSE/SSE2/3/4 assembler instructions
8/14/2019 Intel Processor Architecture-Core
95/155
Copyright 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Coding using SSE/SSE2/3/4 assembler instructions Very tedious (manually schedule) discouraged: Dont do it !
E.g.: How do you exploit the benefits of having now 16 instead of8 SSE registers for Intel 64 without maintaining two versions ?
Intel compilers C/C++ SIMD intrinsics No need to take care of register allocation, scheduling etc
Intel compilers C++ Vector Class Library
Use this if you are heavy into C++ classes
Vectorizer of Intel C++ and Fortran Compilers Recommended for most cases easy and efficient
Use ready-to-go vectorized code from a library likeIntel Math Kernel Library (MKL)
Intel Software CollegeCompiler Based VectorizationProcessor Specific
Linux*Generate Code and Optimize for
8/14/2019 Intel Processor Architecture-Core
96/155
96
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
-xP,-axP
Intelprocessors with SSE3 capability including Pentium 4 (both 32 and 64bitmode) including code generation for MMX, SSE, SSE2 and SSE-3
-xN-axN
Pentium 4 processors in 32, including code generation for MMX, SSE and SSE2- depreciated switch: use xW instead
-axK-axK
Pentium 3 compatible and Athlon XPprocessors including code generation forMMX and SSE
-xW-axW
Pentium 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode,including code generation for MMX, SSE and SSE2
-xT,-axT
Intelprocessors with MNI capability IntelCore2 Duo processors (
Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE-3 and MNI
-xB
-axB
Pentium M processors including code generation for MMX, SSE and SSE-2
Intel Software College
Intel Core Micro-architecture NotableFeatures (cont.) New Instructions
DescriptionInstruction name
ReturnReturn
8/14/2019 Intel Processor Architecture-Core
97/155
97
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Extract any continuous 16 (8 in the 64 bitcase) bytes from the pair [dst, src] andstore them to the dst register.
PALIGNR mm, mm/m64, imm8
PALIGNR xmm, xmm/m128, imm8
A complete byte-granularity permutation,including force-to-zero flag.
PSHUFB mm, mm/m64PSHUFB xmm, xmm/m128
Signed 16 bits multiply, return high bits.PMULHRSW mm, mm/m64
PMULHRSW xmm, xmm/m128
Multiply signed & unsigned bytes.Accumulate result to signed-words.(Multiply Accumulate)
PMADDUBSW mm, mm/m64
PMADDUBSW xmm, xmm/m128
Pairwise integer horizontal subtract + pack.phsubw/d/sw mm, mm/m64
phsubw/d/sw xmm, xmm/m128
Pairwise integer horizontal addition + pack.phaddw/d/sw mm, mm/m64
phaddw/d/sw xmm, xmm/m128
Per element, overwrite destination withabsolute value of source.
pabsb/w/d mm, mm/m64
pabsb/w/d xmm, xmm/m128
Per element, if the source operand isnegative, multiply the destination operandby -1.
psignb/w/d mm, mm/m64
psignb/w/d xmm, xmm/m128
p
Intel Software College
Dependencies and Bypasses
Read-after-Write Dependency - 1 clock stall assuming
8/14/2019 Intel Processor Architecture-Core
98/155
98
Copyright 2006, Intel Corporation. All rights reserved.
Intel Processor Micro-architecture - Core microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Read after Write Dependency 1 clock stall assumingregister file can be written-through
add eax, ecx eax F D E W
sub ebx, eax ebx F D D E W
E to D Bypass - save clock penaltyadd eax, ecx eax F D E W
sub ebx, eax ebx F D E W
Long Latency operations
Load [ecx+edi] eax F D E E E Wadd ebx, eax ebx F D D D E W
Intel Software College
Fighting Stalls: Branch Handling
Gi en the code
8/14/2019 Intel Processor Ar