Date post: | 12-Dec-2014 |
Category: |
Data & Analytics |
Upload: | chris-adkin |
View: | 62 times |
Download: | 1 times |
SQL Server EngineBatch Mode
andCPU Architectures
DBA Level 400
About me
An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years
experience. I have a passion for understanding how the database engine works
at a deep level.
“Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”
A Simple Atypical Data Warehouse Query
11426ms
1095500000 rows
1,798MB in size
Same Query, But With A Subtly Different Column Store
1095500000 rows
8,555MB in size
6060ms
Larger column store, faster query ?!?, could it
be related to the way CPUs work ?
Elapsed Time (ms) / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 240
10000
20000
30000
40000
50000
60000
70000
80000
Non-sorted column store Sorted column store
Degree of Parallelism
Tim
e (m
s)
Percentage CPU Consumption / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 240
10
20
30
40
50
60
Non-sorted Sorted
Degree of Parallelism
Perc
enta
ge C
PU U
tilis
ation
Wait Statistics AnalysisThese stats are for the query ran with a DOP of 24, a warm column store object pool
and the column store created on pre-sorted data.CXPACKET waits can be discounted 99.9 % of the time. Signal wait time = total wait time is to be expected for short waits on uncontended
spin locks.They shed little light on why only 60% CPU utilisation is achievable
cpCCPU Utilisation / DOP for The ‘Sorted’ Column Store High Resolution
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59
5
15
25
35
45
55
Degree of Parallelism
CPU
Util
isati
on
5 % CPU utilisation per core Approximately 60% core utilisation
for 1 st thread to use core ( as expected )
Core
Modern CPU Architecture
L3 Cache
L1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power and
ClockQPIMemory
Controller
L1 Data Cache32KB
Core
CoreL1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
Bi-directional ring bus
IOTLBMemory bus
system-on-chip ( SOC ) design with CPU cores as the basic building block.
Utility services are provisioned by the ‘Un-core’ part of the CPU die.
Four level cache hierarchy.
C P U
QPI…
Un-core
How Modern CPUs Work
Cache Fetch and Decode Execute
Decoded Instructions are micro operations ( uops )
Front End Back End
Non-decoded instructions are complex x84/64 instructions
Batch mode focusseson thisarea
Appendices A and B cover this in greater details.
CPU Pipeline ArchitectureC P U
Front end
Back end
A ‘Pipeline’ made of logical slots runs through the processor.
The front end can allocate upto four micro ops per clock cycle.
The back end can retire upto four micro operations per clock cycle.
Empty slots, or ‘Bubbles’ can indicate front/back end pressure.
The KPIs: Clock Cycles per instruction (CPI), Front end bound and Backend bound are derived from these facts.
AllocationRetirem
ent
Database Engine
Language processing
Sqllang.dll
Runtime Sqlmin.dllexecution and storage engine
Sqltst.dllexpression service
QDS.dllquery data store ( new to SQL 2014)
SQL OS Sqlos.dll Sqldk.dll
The Database Engine Is Layered Also . . .
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
0 20 40 60 80 100 120 140 160 180
4
4
4
11
11
11
14
18
38
167
CPU Cache Access Latencies in Clock Cycles Memory
Batch mode is about working in the 4 ~ 38 clock cycle range and NOT the 167 cycle “CPU stall” range.
A Basic NUMA Architecture
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote memory access
Local memory access
Local memory access
NUMA Node 0 NUMA Node 1
IO H
ub
Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards )
CPU 1 CPU 3
CPU 0 CPU 2
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IO Hub
IO Hub
IO H
ubIO
Hub
IO H
ub
With 18 core Xeon’s in the offing, these topologies will become increasingly rare.
NUMA Node Remote Memory Access LatencyMemory Access Latency
An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )
CPU Stalls and Batch Mode, Memory Access Patterns Are Key AccessSegment
HashKey Value
Sequential scans make it easy for the pre-fetcher to determine the need to perform read-a-heads
Random memory access will make the pre-fetcher see little value in performing read-a-heads.
The ability of the pre-fetcher to do its job determines the likelihood of the data you require being in one of the CPU caches.
OLTP Workloads, CPU Stalls and Hyper Threading ( Nehalem i7 onwards ) Access
n row B-tree
L1
L2
L3
Last levelcache miss n row B-tree
Session 1 performs an index seek, the page is not found in the CPU cache. A CPU stall takes place
( 160 clock cycles+ ) whilst the page is retrieved from memory.
The ‘Dead’ CPU cycles incurred by the stall gives the physical core the opportunity to run the 2nd hyper thread.
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
SQL OS LayerOne Scheduler Per Logical Processor
The OS has no concept of SQLresources such as latches.
SQL OS schedulers act as virtual processors turning context switches into soft context switches.
SQL OS scheduler threads byprioritizing L2 cache hits and reuse over ‘Fairness’.
The Rationale Behind SQL OS
Putting Batch Modeand
How CPUs Work Together
CPU
+
Test Setup
CPU6 core 2.0 Ghz (Sandybridge)
Warm large object cache used in all tests to remove storage as a factor.
CPU6 core 2.0 Ghz (Sandybridge)
48 Gb quad channel 1333 Mhz DDR3 memory
Hyper-threading enabled, unless specified otherwise.
Windows Performance Analysis Tool Stack
Weight is sampled CPU times across all physical cores. At the top of the stack ( above ) SQL Server has consumed 24.22% of the weight. Further down the stack 14.45 % of the weight sqlmin.dll!
CBpagHashTable::FAggregateBatch
Control flow
Data flow
Where Is The Bottleneck In The Plan ?
The stack trace is indicating that theBottleneck is right here
Segment Hash Key Value
Scan
Segment Hash Key Value
Scan
Column Store created on the heap
Hash probes causing random memory access.
Hash table is likely to be at the high latency end of the cache hierarchy.
Hypothesis
Column Store created on the clustered index
Hash probes causing sequential memory access.
Hash table is likely to be at the low latency end of the cache hierarchy.
Introducing Intel VTune Amplifier XE
Investigating what happens at the CPU cache, clock cycle and instruction level requires tools outside of the standard set that ships with Windows and SQL Server.
VTune Amplifier uses hardware event sampling to determine what is happening on the CPU.
Refer to Appendix D for an overview of what “General exploration” provides.
Hash ( Aggregate ) Probe Random Vs Sequential Access Efficiency
181,578,272,367 versus 466,000,699 clock cycles !!!
Last Level Caches Misses / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 24
Non-sorted 13200924 30902163 161411298 1835828499 2069544858 4580720628 2796495741 3080615628 3950376507 4419593391 4952446647 5311271763
Sorted 3000210 1200084 16203164 29102037 34802436 35102457 48903413 64204494 63004410 85205964 68404788 72605082
500,000,000
1,500,000,000
2,500,000,000
3,500,000,000
4,500,000,000
5,500,000,000
Non-sorted Sorted
Degree of Parallelism
LLC
Miss
es
Segment Sizes Per Column For The Two Column Stores
Takeaways
The aggregate size of a column store does not tell the entire story as to how well a query will perform using it:
Compressing the column used for the hash aggregate probes such that it fits inside the CPU cache.and
Turning random memory access on hash probes to sequential memory access.
Leads to savings in CPU cycles which outweigh the costs of scanning in a column store which is larger than one based on the same non-pre-sorted data.
Batch mode hash joins leverages SKU in order to leverage the CPU cache better ( explained in the next two slides )
Row Mode Hash Join
Exchange Exchange Exchange Exchange
Data Skew reduces parallelism by increasing the level of re-partition activity. Skew works against us. Repartitioning is expensive due to the workspace buffer management overheads
involved.
Probe Input Build Input
Hash table partitioned across NUMA nodes
Batch Mode Hash JoinC P U
No expensive repartitioning overhead. Skew works for us. Hash keys more likely to be in L2/3
cache as the skew increases the probability of the same keys being hit.
B1
B2
BN
B4
Thread Thread Thread
BNB1 B2
Thread Thread
Does The Stream Aggregate Perform Any Better ?
Even with the ‘Sorted’ column store, performance is killed by a huge row mode sort prior to the stream aggregate.
The above query takes seven minutes and nine seconds to run.
NUMA Architectures and Thread Scheduling
Node 0 Node 1There are 24 logical processors and a DOP of 12.
How does SQLOS choose to schedule the threads, on one NUMA node, or does it split them between the two ?
Hyper-Threading And CPU Core Scheduling
With 24 logical processors and a DOP of 6, how does SQLOS schedule hyper-threads in relation to physical cores ?
CPU socket 0
CPU socket 1
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5
Ensure a consistent flow of data from physical disk to server CPUs
Avoiding CPU Core Starvation
STORAGE PROCESSORCPU
HBA SAN Fabric SAN Ports
D A T A F L O W
The “Fast track” methodology involves the creation of balanced architectures that prevent CPU core starvation.The same thing applies to the world of “In memory”.
Making Efficient Use Of The CPU In The “In Memory” World
C P U
D A T A F L O WD A T A F L O WFront endBack end
Backend boundNo uops are delivered due to a lack of resources at the backend of the pipeline ( port saturation )
Frontend boundFront end delivering < 4uops per cycle whilst backend is ready to accept uops( CPU stalls )
CPU KPI / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 240
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CPI Front end Bound Back end bound
Degree of Parallelism
KPI V
alue
Refer to Appendix C for the formulae from which these metrics are derived.
Which Parts Of The Database Engine Are Suffering Backend Pressure ?
Results obtained for a degree of parallelism of 24.
256-bitFMULBlend
Ports Saturation, The Excessive Demand For An Execution Unit
( *Sandybridge core)
CBagAggregateExpression::TryAggregateUsingQE_Pure
*
Ports Saturation Analysis
0
1
2
3
4
5
0 0.1 0.2 0.3 0.4 0.5 0.6
0.46
0.45
0.16
0.17
0.1
0.55
Hit Ratio
Port
0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.
Lowering Clock Cycles Per Instruction By Leveraging SIMD
1 2 3 4
2 3 4 5
+
3 5 7 9
=
1 2+ 3=Scalar instructionC = A + B
SIMD instruction
Vector C = Vector A + Vector B
Does The SQL Server Database Engine Leverage SIMD Instructions ?
AVX 1.0 (Sandybridge) offers x2 peak floating point performance.
Haswell offers more flexibility including integer vectorisation ( AVX 2 ).
Skylake introduces wide 512 bit vectorisation ( AVX 3.2 ).
Conclusions
Memory is incredibly nuanced, where the memory is and how it is accessed are significant factors in software performance.
Batch mode aims to eliminate the huge clock cycle penalties paid by accessing main memory: memory is the new disk.
Batch mode in SQL Server 2014 CU2 is not scalable, some bottleneck which cannot be identified through SQL Server itself is leading to 60% CPU utilisation ceiling.
To work carried out to date on the database engine has helped to minimise CPU stalls, this has pushed the bottleneck onto the CPUs backend. Microsoft now needs to address this, the leveraging of SIMD technology could help in this area.
Questions ?
http://uk.linkedin.com/in/wollatondba
Contact Details
ChrisAdkin8
Appendices
Appendix A: Instruction Execution And The CPU Front / Back Ends
Cache
FetchDecodeExecuteBranchPredict
DecodedInstruction
BufferExecute
ReorderAnd
Retire
Front end Back end
Appendix B - The CPU Front / Back Ends In Detail
Front end Back end
Front end bound ( smaller is better ) IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
Bad speculation (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks) Retiring
UOPS_RETIRE_SLOTS / (4 * Clock ticks) Back end bound ( ideally, should = 1 - Retiring)
1 – (Front end bound + Bad speculation + Retiring)
Appendix C - CPU Pressure Points, Important Calculations
An illustration of what the “General exploration” analysis capability of the tool provides
Appendix D - VTune Amplifier General Exploration