High Scale OLTP Lessons Learned from SQLCAT Performance Labs

Designing Highly Scalable OLTP Systems

High Scale OLTPLessons Learned from SQLCAT Performance LabsEwan Fairweather: Program ManagerMicrosoftWith the combination of Windows Server 2008 R2 and SQL Server 2008 R2 it has now become possible to run SQL Server on up to 256 Cores. Recently, the SQL CAT team had a customer in the lab to test a banking application at high scale-up. Previously, we had test runs for the same application on 32-cores but now the time had come to stress the app to 128 Cores.In this session, I will talk about the lessons we learned from stress testing an OLTP workload at this scale. You will see some interesting bottlenecks and get an idea of what sort of numbers are achievable on a big SQL Server box. Of course, I will also provide you with the guidance on how to get those numbers.

1Session Objectives and TakeawaysSession Objectives: Learn about SQL Server capabilities and challenges experienced by some of our extreme OLTP customer scenarios.Insight into diagnosing and architecting around issues with Tier-1, mission critical workloads.

Key TakeawaysSQL Server can meet the needs of many of the most challenging OLTP scenarios in the world.There are a number of new challenges when designing for high end OLTP systems.2Laying the foundation and tuning for OLTP Laying the foundation and tuning for OLTP workloads:Understand goals and attributes of workloadPerformance requirements Machine born data vs. User driven solutionRead-Write ratioHA/DR requirements which may have an impact Apply Configuration and Best Practices guidanceDatabase and data file considerations Transaction Log sizing and placementConfiguring the SQL Server Tempdb DatabaseOptimizing memory configurationBe familiar with common performance methodologies, toolsets and common OTLP / Scaling performance pain pointsKnow your environment Understand hardware is key

4Hardware Setup Database filesDatabase Files # should be at least 25% of CPU coresThis alleviates PFS contention PAGELATCH_UP There is no signficant point of diminishing returns up to 100% of CPU coresBut manageability, is an issue...Though Windows 2008R2 is much easierTempDb PFS contention is a larger problem here as its an instance wide resource Deallocations and Allocations , RCSI version store, triggers, temp tables# files shoudl be exactly 100% of CPU ThreadsPresize at 2 x Physical MemoryData files and TempDb on same LUNsIts all random anyway dont sub-optimizeIOPS is a global resource for the machine. Goal is to avoid PAGEIOLATCH on any data file Key Takeaway: Script it! At this scale, manual work WILL drive you insaneThere is a lot os scripts available to do this. Some of these can be had by mailing me your Live Account

Temp swapping out of memory error is > 2x Index builds will be done in memory, if there is memory pressure TempDb 5Special Consideration: Transaction LogTransaction log is a set of 127 linked buffers with max 32 outstanding IOPSEach buffer is 60KBMultiple transactions can fit in one bufferBUT: Buffer must flush before log manager can signal a commit OKPre-allocate log file Use dbcc loginfo for existing systemsExample: Transaction log throughput was ~80MB/secBut we consistently got 2x per core processing over previous IA64 CPUs 14HealthCare Application - Technical Challenges ChallengeConsideration/WorkaroundNetwork10 Gb/s network used no bottlenecks observed ConcurrencyObserved spikes in CPU at random times during workload Significant spinlocks contention on SOS_CACHESTORE due to frequent re-generation of security tokensHotfix provided by SQL Server team Result SOS_CACHESTORE contention removed Spinlock contention on LOCK_HASH due to heavy reading of same rowsThis was due to an incorrect parameter being passed in by test workloadResult LOCK_HASH contention removed, reduced CPU from 100% to 18%

Transaction LogSynchronous replication at the storage level Observed 10-30ms for log latency expected 3-5ms Encountered Virtual Log File fragmentation (dbcc loginfo) rebuilt logObserved overutilization of front end fiber channel ports on array - reconfigured storage balancing traffic across front end ports Result: 3-5ms latency Database and table design/SchemaSchema utilizes hash partitioning to avoid page latch contention on inserts Requirement for low privileges requires longer code paths in the SQL engine

MonitoringHeavily utilized Extended Events to diagnose spinlock contention points Architecture/HardwareCurrently running 16 socket IA64 in production Benchmark performed on 8 socket x64 Nehalem-EX (64 physical cores)Hyper-threading to 128 logical cores offered little benefit to this workload Encountered high NUMA latencies (coreinfo.exe) resolved via firmware updates15NUMA latencies Sysinternals CoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx Nehalem-EXEvery socket is a NUMA node How fast is your interconnect.

Log Growth and Virtual Log File FragmentationSQL Server physical transaction log is comprised of Virtual Log Files (VLFs) Each auto-growth/growth event will add additional VLFs Frequent auto-growths can introduce a large number of VLFs which can have a negative effect on log performance due to: Overhead of the additional VLFsFile system fragmentationAdditional information can be found here Consider rebuilding log if you find 100s or 1,000s of VLFsDBCC LOGINFO can be used to report on this (example below)FileId FileSize StartOffset FSeqNo Status Parity CreateLSN----------- -------------------- -------------------- ----------- ----------- ------ ---------------------------------------2 253952 8192 48141 0 64 02 427556864 74398826496 0 0 128 229700000473272006492 427950080 74826383360 0 0 128 22970000047327200649

INTERNAL ONLY17Spinlocks Lightweight synchronization primitives used to protect access to data structuresUsed to protect structures in SQL such as lock hash tables (LOCK_HASH), security caches (SOS_CACHESTORE) and moreUsed when it is expected that resources will be held for a very short durationWhy not yield?It would be more expensive to yield and context switch than spin to acquire the resourceResourceLock ManagerLock Hash TableThread attempts to obtain lock (row, page, database, etc..Threads accessing the same hash bucket of the table are synchronized LOCK_HASHHash of lock maintained in hash table 18Spinlocks Diagnosisselect * from sys.dm_os_spinlock_statsorder by spins desc

123These symptoms may indicate spinlock contention:1. A high number of spins is reported for a particular spinlock type. AND 2. The system is experiencing heavy CPU utilization.AND 3. The system has a high amount of concurrency. 4

519Spinlock Diagnosis Walk Through

1NameCollisionsSpinsSpins_Per_Collision BackoffsSOS_CACHESTORE14,752,117942,869,471,52663,91467,900,620SOS_SUSPEND_QUEUE69,267,367473,760,338,7656,8402,167,281LOCK_HASH5,765,761260,885,816,58445,2473,739,208MUTEX2,802,7739,767,503,6823,485350,997SOS_SCHEDULER1,207,0073,692,845,5723,060109,7462

3Extended events capture the backoff events over a 1 min interval & provide the code paths of the contention security check relatedNot a resolution but we know where to start Much higher CPU with drop in throughput (At this point many SQL threads are spinning)Confirmed theory via dm_os_spinlock_stats observe this type with highest spins & backoffsHigh backoffs = contention20Spinlock Walkthrough Extended Events Script--Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE')

--create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack) where type = 144--SOS_CACHESTORE) add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)

--Ensure the session was createdselect * from sys.dm_xe_sessionswhere name = 'spin_lock_backoff'

--Run this section to measure the contention alter event session spin_lock_backoff on server state=start

--wait to measure the number of backoffs over a 1 minute periodwaitfor delay '00:01:00'

--To view the data--1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1)

--Get the callstacks from the bucketize targetselect event_session_address, target_name, execution_count, cast (target_data as XML)from sys.dm_xe_session_targets xstinner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)where xs.name = 'spin_lock_backoff'

--clean up the session alter event session spin_lock_backoff on server state=stopdrop event session spin_lock_backoff on serverA complete walkthrough of the technique can be found here:

http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspxObservation: It is counterintuitive to have high waits times (LCK_M_X) correlate with heavy CPU This is the symptom not the causeRegeneration of Security Tokens Result in High SOS_CACHESTORE SpinsAt Random Times CPU spikes, then almost all sessions waiting on LCK_M_X

1

2Huge increase in number of spins & backoffs associated with SOS_CACHESTORE34Approach: Use extended events to profile the code path with the spinlock contention (i.e. where there is a high number of backoffs)5Root cause: Regeneration of security tokens exposes contention in code paths for access permission checks Workaround/Problem Isolation: Run with sysadmin rights Long Term Changes Required: SQL Server fix22Fully Qualified Calls To Stored ProceduresDeveloper uses Exec myproc for dbo.myprocSQL acquires an exclusive lock LCK_M_X and prepares to compile the procedure; this includes calculating the object IDdm_exec_requests revealed almost all the sessions were waiting on LCK_M_X to compile a stored procedureWorkaround: make app user DB_Owner

SOS_CACHESTORE GetObjectbyObjectID GetOwnerbySID23Case Study: Point of Sale (POS) SystemApplication: Point of Sale application supporting sales at 8,000 storesPerformance:Sustain expected peak load of ~230 business transactions (checks) per second

Workload Characteristics:230 business transactions = ~50,000 batches/sec Heavy insert into a few tables, periodic range scans of newly added dataHeavy network utilization due to inserts and use of BLOB data

Hardware/Deployment Configuration:Custom test harness, 12 Load Generators, 5 Application servers Database servers: HP DL 78548 Physical cores, 256GB RAM24Case Study: Point of Sale (POS) System (cont.)Other Solution Requirements:Mission critical to the business in terms of performance and availability. Strict uptime requirements. SQL Server Failover Clustering for local (within datacenter) availabilityStorage based replication (EMC SRDF) for disaster recoveryQuick recovery time for failover is a priority. Observation Initial tests showed low overall system utilization Long duration for insert statements High waits on buffer pages (PAGELATCH_EX/PAGELATCH_SH)Network bottlenecks once the latch waits were resolvedRecovery times (failure to DB online) after failover under full load were between 45 seconds and 3 minutes for unplanned node failures

25POS Benchmark Configuration

SANCX-960(240 drives,15K, 300GB)

5 x App servers:5 x BL4602 proc (quad core), 32bit32 GB memory

12 x Load drivers:2 proc (quad core), x6432+ GB memorySwitchTransaction DB Server1 x DL7858P (quad core), 2.3GHz256 GB RAM

Network switch

Reporting DB Server1 x DL5854P (dual core), 2.6 GHz32 GB RAMSwitchSAN switchBrocade 4900(32-ports active)DL785DL585BL460 Blade Servers Dell R900s , R805sActive/Active Failover clusterTechnical Challenges and Architecting for Extreme OLTPChallengeConsideration/WorkaroundNetwork

CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)Further network optimization was performed by implementing compression in the application After optimizations were able to push ~180K packets/sec, approx 111 MB/sec through a single 1 Gb/s NIC.

ConcurrencyPage buffer latch waits were by far the biggest pain point Hash partitioning was used to scale-out the btrees and eliminate the contentionSome PFS contention for the tables containing LOB data resolved by placing LOB tables on dedicated filegroups and adding more files

Transaction LogNo log bottlenecks were observed. When cache on the array behaves well log response times are very low. Database and table design/SchemaObserved overhead related to PK/FK relationships. Insert statements required additional work. Adding persisted computed column needed for hash partitioning is an offline operation.Moving LOB data is an offline operation.

MonitoringFor the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and dm_db_index_operational_stats to identify indexes with most contentionArchitecture/HardwareBe careful about shared components in Blade server deployments this became a bottleneck for our middle tier. 27Hot Latches!We observed very high waits for PAGELATCH_EXHigh = more than 1ms, we observed greater than 20 ms Be careful drawing conclusions just on averages What are we contending on?Latch a light weight semaphoreLocks are logical (transactional consistency)Latches are physical (memory consitency)Because rows are small (many fit a page) multiple threads accesses single page may compete for one PAGELATCH even if there is no lock blocking

Page (8K)ROWROWROWROW EX_LATCHINSERT VALUES(298, xxxx) IX Page 298

EX_LATCHINSERT VALUES(299, xxxx ) IX Page 299

EX_LATCH wait28Waits & LatchesDig into details with: sys.dm_os_wait_statssys.dm_os_latch_waits

wait_type% Wait TimePAGELATCH_EX86.4%PAGELATCH_SH8.2%LATCH_SH1.5%LATCH_EX1.0%LOGMGR_QUEUE0.9%CHECKPOINT_QUEUE0.8%ASYNC_NETWORK_IO0.8%WRITELOG0.4%latch_classwait_time_msACCESS_METHODS_HOBT_VIRTUAL_ROOT 156,818 LOG_MANAGER 103,316 INTERNAL ONLY29Waits & Latches Server Levelsys.dm_os_wait_statsselect *, wait_time_ms/waiting_tasks_count [avg_wait_time], signal_wait_time_ms/waiting_tasks_count [avg_signal_wait_time]from sys.dm_os_wait_statswhere wait_time_ms > 0 and wait_type like '%PAGELATCH%'order by wait_time_ms desc 30Waits & Latches Index Level sys.dm_db_index_operational_stats/* latch waits********************************************/select top 20 database_id, object_id, index_id, count(partition_number) [num partitions],sum(leaf_insert_count) [leaf_insert_count], sum(leaf_delete_count) [leaf_delete_count],sum(leaf_update_count) [leaf_update_count],sum(singleton_lookup_count) [singleton_lookup_count],sum(range_scan_count) [range_scan_count],sum(page_latch_wait_in_ms) [page_latch_wait_in_ms], sum(page_latch_wait_count) [page_latch_wait_count],sum(page_latch_wait_in_ms) / sum(page_latch_wait_count) [avg_page_latch_wait],sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms], sum(tree_page_latch_wait_count) [tree_page_latch_wait_count],case when (sum(tree_page_latch_wait_count) = 0) then 0 else sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count) end [avg_tree_page_latch_wait]from sys.dm_db_index_operational_stats (null, null, null, null) os where page_latch_wait_count > 0 group by database_id, object_id, index_idorder by sum(page_latch_wait_in_ms) desc 31Hot Latches - Last Page Insert ContentionMost common for indexes which have monotinically increasing key values (i.e. Datetime, identity, etc..) Our scenario Two tables were insert heavy, by far receiving the highest number of inserts INSERT mainly however there is background process reading off ranges of the newly added dataAnd dont forget We have to obtain latches on the non-leaf Btree pages as well.Page Latch waits vs. Tree Page Latch waits (sys.dm_db_index_operational stats) B-tree PageB-tree PageB-tree PageLeaf Pages Tree Pages Logical Key Order of Index Monotonically IncreasingData PageData PageDatePageData PageData PageData PageData PageDatePageData PageData PageData PageData PageData PageWe call this Last Page Insert Contention

Many threads inserting into end of range

Expect: PAGELATCH_EX/SH waitsAnd this is the observation32How to Solve INSERT hotspotOption #1: Hash partition the tableBased on hash of a column (commonly a modulo)Creates multiple B-trees (each partition is a B-tree)Round robin between the B-trees create more resources and less contentionOption #2: Do not use a sequential keyDistribute the inserts all over the B-tree0-10001001- 20002001- 30003001- 4000INSERTINSERTINSERTINSERTHash Partitioning Reference: http://sqlcat.com/technicalnotes/archive/2009/09/22/resolving-pagelatch-contention-on-highly-concurrent-insert-workloads-part-1.aspxBefore Threads inserting into end of range contention on last pageThreads inserting into end of range but across each partition

AfterHash Partitioned Table / Index33Example: Before Hash Partitioning

Latch waits of approximately 36 ms at baseline of 99 checks/sec.

12Example: After Hash Partitioning**Other optimizations were applied, Hash Partitioning was responsible to a 2.5x improvement in insert throughput

Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec.1

2

3435Table Partitioning Example--Create the partition scheme and function CREATE PARTITION FUNCTION [pf_hash16] (tinyint) AS RANGE LEFT FOR VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)CREATE PARTITION SCHEME [ps_hash16] AS PARTITION [pf_hash16] ALL TO ( [ALL_DATA] )-- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way)- Consider using bulk loading techniques to speed it up.ALTER TABLE [dbo].[Transaction] ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0))) PERSISTED NOT NULL--Create the index on the new partitioning schemeCREATE UNIQUE CLUSTERED INDEX [IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ], [HashValue]) ON ps_hash16(HashValue) 123Note: Requires application changesEnsure Select/Update/Delete have appropriate partition elimination

1)Create a range partitioning function and scheme with 32 ranges (aligning with the CPU cores).2)Create a persisted computed column which uses an existing key field in the index which computes a value from 1 to 32 (used to determine the partition the record will land on). 3)Rebuild the index on the new partitioning scheme using this column and adding the computed column to the index key definition (if necessary).

Application changes This means that the application needs to know (or be able to generate) the HashValue Addition of the computed column that is needed to support this solution is an OFFLINE operation. This is not acceptable for many existing existing production deployment.

Scenarios where we have seen PAGELATCH_EX contention:Indexes that utilize an increasing data time column as leading keyIndexes that utilize an identity column as the leading key columnIndexes which contain a key with low cardinality

Note that partitions functions on TINYINT will output binary values (i.e. 0x01 0x0E) instead of integers36

Query Engine Good Plan37

Query Engine Bad PlanTable partitioning requires update to the clustered index so it is partition aware by including a hash valueThis is transparent to insertsThis makes current dequeue stored procedures and SQL agent jobs very expensive, thoughNeed to update queries: the where clauses need to be updated to include the Hash valueDequeue INNER LOOP Hint forces the query engine to scan all partitionsPossible bug. VSTS #380030

38Network Cards Rule of ThumbAt scale, network traffic will generate a LOT of interrupts for the CPUThese must be handled by CPU CoresMust distribute packets to cores for processing

39Tuning a Single NIC Card POS systemEnable RSS to enable multiple CPUs to process receive indications:http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspxThe next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload. Turning off Base Filtering Service huge reduction in CPU may not be suitable for all production environments Careful with Chimney Offload as per KB 942861

40

Before and After Tuning Single NIC

Before any network changes the workload was CPU bound on CPU0After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to another CPU. 12

3Single 1 Gb/s NIC41To DTC or not to DTC: POS System Com+ transactional applications are still prevalent todayThis results in all database calls enlisting in a DTC transaction 45% performance overheadScenario in the lab involved two Resource Managers MSMQ and SQL:

Tuning approaches Optimize DTC TM configuration (transparent to app)Remove DTC transactions (requires app changes)Utilize System.Transactions which will only promote to DTC if more than one RM is involved See Lightweight transactions: http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5

wait_typetotal_wait_time_mstotal_waiting_tasks_countaverage_wait_msDTC_STATE5,477,997,9344,523,0191,211PREEMPTIVE_TRANSIMPORT2,852,073,2823,672,147776PREEMPTIVE_DTC_ENLIST2,718,413,4583,670,307740We have measured the overhead of DTC in previous benchmarks and found that 45% performance hit for starting a distributed transaction.

The API being used was OLEDB and the application is native code written in C++. The application servers are running Windows 2003 SP2.

RM Resource Manger SQL MSMQ etcTM Transaction Manger DTC Service can be local or remote

PREEMPTIVE_TRANSIMPORT & PREEMPTIVE_DTC_ENLIST are expected, and since these are PREEMPTIVE_ the wait time reflects the time inside the call and is recorded for each call. So the average time to import a transaction is 776ms and to enlist in to the transaction is 740ms, which given the fact both invoke RPC messages seems about right. The code path that tracks both of these must also call to the code that tracks the DTC_STATE waits. So any waits within them will have an implicit wait on DTC_STATE. By default, the application servers use the local MSDTC coordinator to manage the transactions. This requires RPC communication between the SQL Server and the remote coordinator which can introduce significant overhead under high transactional load. In addition the application servers are running within VMWare VMs and it is likely that running within a VM introduces some additional overhead in the network stack which can impact this even more. To address this, the following configuration change was made. Instead of relying on the local MSDTC coordinator the application servers can be configured to utilize a remote coordinator which resides on the database server. This removes the need for RPC communication between the database server and application server to manage the DTC transactions. In addition, this results in using MSDTC which is part of Windows 2008 which has significant improvements over Win2003. After making this change the DTC related waits became a non-issue and did not show up in the top 10 list of wait types.

http://msdn.microsoft.com/en-us/library/ms973865.aspx#introsystemtransact_topic5

42Optimizing DTC ConfigurationDefault application servers use local TM (MSDTC Coordinator)Introduces RPC communication between SQL TM and App Server TMApp virtualization layer incurs some delayConfiguring application servers to use remote coordinator removes RPC communicationSee Mike Ruthruffs paper on SQLCAT.COM:http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

Transaction_Ownership waits includes coordination transactions from client MSDTC, ADO.NETSystem.Transaction43Recap: Session Objectives and TakeawaysSession Objectives: Learn about SQL Server capabilities and challenges experienced by some of our extreme OLTP customer scenarios.Insight into diagnosing and architecting around issues with Tier-1, mission critical workloads.

Key TakeawaysSQL Server can meet the needs of many of the most challenging OLTP scenarios in the world.There are a number of new challenges when designing for high end OLTP systems.44Applied Architecture Patterns on the Microsoft Platform

QA&QA& # 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. #AgendaWindows Server 2008R2 and SQL Server 2008R2 improvementsScale architectureCustomer RequirementsHardware setupTransaction log essentialsGetting the code rightApplication Server EssentialsDatabase DesignTuning Data ModificationUPDATE statementsINSERT statementsManagement of LOB dataThe problem with NUMA and what to do about itFinal results and Thoughts #DB Size = 1TB for test. 1,500 connections, tx=18,000 SQL tx/sec

After6,000 concurrent connections100,000 SQL tx/sec48Top statisticsCategoryMetricLargest single database80 TBLargest table20 TBBiggest total data 1 customer2.5 PBHighest write per second 1 db60,000Fastest I/O subsystem in production (and in lab)18 GB/sec(26GB/sec)Fastest real time cube1 sec latencydata load for 1TB20 minutesLargest cube12 TB #Talking points: SQL does scale.

Largest DB: Telecom: 75TB in one box (64 core Superdome with dedicated 1068 spindle, XP24K SAN)PanStarrs: 80TB x64 with 64GB RAMFastest Data Load: 1TB in 20 minutes (Unisys ES7600R 64 core with ETL World Record run and SSD on TPCH LINEITEMS)(about 7 minutes on 256 core machine using Windows 2008 R2 and SQL 2008 R2)Highest SQL TX/sec on single instance (Core Banking): 36K tx/secHighest, Cube Processing speed (during Process Data phase) telecom with 1M rows/sec (64 Core Superdome for SSAS and another 64 core dome for SQL)Fastest I/O system in production (single box): Core banking at 18GB/sec with table scan workloads on a DAS (I-O system can do 25GB/sec when hitting it high (1MB) blocksize I/O)Highest Cube Processing Speed (Retail) 5M rows/secFastest Real Time Cube: Financial Currency Trading at 15 sec latency.Biggest total data: Genology site using RBS with meta data in SQL. Also, MySpace > 1TB total.

49Upping the LimitsPrevious (before 2008R2) windows was limited to 64 coresKernel tuned for this config With Windows Server 2008R2 this limit is now upped to 256 Cores (plumbing for 1024 cores)New concept: Kernel GroupsA bit like NUMA, but an extra layer in the hierarchySQL Server generally follows suit but for now, 256 Cores is limit on R2Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up to 256 Cores)And largest IA-64 is 256 Hyperthread (at 128 Cores)

#50The Path to the SocketsWindows OS Kernel Group 0NUMA 0NUMA 1NUMA 2NUMA 3NUMA 4NUMA 5NUMA 6NUMA 7 Kernel Group 1NUMA 8NUMA 9NUMA 10NUMA 11NUMA 12NUMA 13NUMA 14NUMA 15 Kernel Group 2NUMA 16NUMA 17NUMA 18NUMA 19NUMA 20NUMA 21NUMA 22NUMA 23 Kernel Group 3NUMA 24NUMA 25NUMA 26NUMA 27NUMA 28NUMA 29NUMA 30NUMA 31HardwareNUMA 6 CPU Socket CPU Core HT HT CPU Core HT HT CPU Socket CPU Core HT HT CPU Core HT HTNUMA 7 CPU Socket CPU Core HT HT CPU Core HT HT CPU Socket CPU Core HT HT CPU Core HT HT #51And we measure it like thisSysinternals CoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx Nehalem-EXEvery socket is a NUMA node How fast is your interconnect.

#And it Looks Like This...

#Customer ScenariosCore BankingHealthcare SystemPOSWorkloadCredit Card transactions from ATM and BranchesSharing patient information across multiple healthcare trustsWorld record deployment of ISV POS application across 8,000 US storesScale Requirements10.000 Business Transactions / sec37,500 concurrent usersHandle peak holiday load of 228 checks/secTechnologyApp Tier .NET 3.5/WCFSQL 2008R2Windows 2008R2App Tier: .NETSQL 2008R2 Windows 2008R2 Virtualized App Tier: Com+, Windows 2003SQL 2008, Windows 2008 ServerHP SuperdomeHP DL785G6IBM 3950 and HP DL 980 DL785 #Network Cards Rule of ThumbAt scale, network traffic will generate a LOT of interrupts for the CPUThese must be handled by CPU CoresMust distribute packets to cores for processingRule of thumb (OTLP): 1 NIC / 16 CoresWatch the DPC activity in TaskmanagerIn Windows 20003 remove SQL Server (with affinity mask) from the NIC cores

#Lab: Network Tuning ApproachesTuning configuration options of a single NIC card to provide the maximum throughput.Improve the application code to compress LOB data before sending it to the SQL ServerTeam a pair of 1 Gb/s NICs to provide more bandwidth (transparent to the app).Add multiple NICS (better for scale ) #Tuning a Single NIC Card POS systemEnable RSS to enable multiple CPUs to process receive indications:http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspxThe next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload. Careful with Chimney Offload as per KB 942861

#Disabling base filtering service reduces security, but also the overhead on the system therefore important that SQL is isolated

Windows Server 2003 SP1 and earlier versions do not allow multiple processors to concurrently process receive indications from a single-network adapter. Highly concurrent SQL OLTP workload - In these scenarios, the lack of parallelism in NDIS v5.x packet receive processing results in an overall lack of scalingsome contemporary CPUs and chipsets route all interrupts from a single network adapter to one specific processor, which results in a similar lack of parallelism. Therefore, scaling issues only increase because one CPU handles all device interrupts.

However, note that some NIC will create the Communication Link Failure error when Chimney is turned on. Test this carefullyhttp://support.microsoft.com/kb/942861 57

Before and After Tuning Single NIC

Before any network changes the workload was CPU bound on CPU0After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to another CPU.

12

3 #Application blade servers shared a 1 Gb/s network connection

The absolute maximum throughput we achieved on a single NIC card is was 111 million bytes 1 Gbps =128 MB/sec111 MB/sec

Mention that moving Base CPU from CPU 0 benefits as this is used for other Windows activities This includes the network tuning we performed and changes to the application which compressed LOB data prior to sending to SQL Server.

58Teaming NICSWorkload bound by network throughput Teaming of 2 network adapters realized no more aggregate throughput Application blade servers shared a 1 Gb/s network connectionLeft until next episode Consider 10 Gbps NICS for this throughput

#Workload was still bound by network throughput Teaming of 2 network adapters was performed on the server however no more than aggregate throughput was realized .Application servers were running on blade servers which shared a 1 Gb/s network connectionTime Constriants and fact we had already exceeded the goal of the lab meant this was left until next time

Some general network tuning options (courtesy of our SQL performance team) are given below. Not all of these may be suitable for production deployments and there is some dependency on the specific network adapters being used but are options for tuning. Point-to-point connections between two machinesPower settings in the OS set to high-performanceFor TCP/IP: Disabled Base Filtering Engine Service which in turn disables firewall, IPSec Policy agent and IKEHT turned offPower Policy set to high performance (by changing the settings in the control panel as well the BIOS) Enabled RSSChanged the number of RSS Queues to be equal to 8 (equal to the number of processors). Hyper-threaded CPUs are not used since RSS wont use them.Disabled header data splitInterrupt moderation: Disabled in the default case and set to adaptive in the Opt case.Disabled RSS on any other NIC in the server. (RSS load distribution not uniform otherwise). Unbinded IPv6 from the NIC.

59SQL Server Configuration ChangesAs we increased number of connections to around 6000 (users had think time) we started seeing waits on THREADPOOLSolution: increase sp_configure max worker threadsProbably dont want to go higher than 4096Gradually increase it, default max is 980Avoid killing yourself in thread management bottleneck is likely somewhere elseUse affinity mask to get rid of SQL Server for cores running NIC trafficWell tuned, pure play OLTPNo need to consider parallel plansSp_configure max degree of parallelism, 1

#Ended up with 2048 on this project.60Getting the Code RightDesigning Highly Scalable OLTP Systems #Lessons from ISV ApplicationsParameterize or pay the CPU cost and potentially hit the gateway limits for compilations (RESOURCE_SEMAPHORE_QUERY_COMPILATIONS)Watch out for cursors They tie up worker threads and if they consume workspace memory you could see blocking (RESOURCE_SEMAPHORE)Consume those results as quickly as possible (watch for ASYNC_NETOWORK_IO)Schema designFor insert heavy workload RI can be very expensive.If performance is key, work out the RI outside the DB and trust you app

#RESOURCE_SEMAPHORE_QUERY_COMPILATIONS: This governs the number of concurrent compilationsQuery Parameterization Performance ImpactIn general the most significant issue related to performance was due to excessive compilations which are caused by poor query plan reuse. This is a result of queries submitted to SQL Server not being parameterized. The Forced Parameterization option (new in SQL 2005) helps with this issue however not all queries submitted to the server meet the needed requirements to use this feature. The impacts performance in the following ways:Compilation is expense with respect to CPU consumption. CPU consumption by SQL Server could be reduced significantly by parameterizing the queries. SQL Server query plan cache (sys.dm_os_plan_cache) will consume more memory than should be required by the application due to many occurrences of single use plans. This is a significant problem on 32-bit since memory for plan cache is restricted to the 2/3 GB of user mode virtual address space (not supported by AWE memory).Under load there is contention related to the high number of concurrent compilation requests. SQL Server is designed to allow a certain number of concurrent compilations. When this is exceeded queries requiring compilation will wait. If monitored through sys.dm_exec_requests or sys.dm_os_wait_types this wait will show up as RESOURCE_SEMEPHORE_QUERY_COMPILE. SQL Server behavior for concurrent compilations can be explained as follows:Queries are organized by optimizer memory consumption as small, medium, large and extra large. The definition of small queries is hardcoded (taking less than 250K memory for optimization on x86, 500K on IA64 and 380K on x64). The definition of medium large and extra-large queries in not so straightforward and is a fraction of the optimizer memory used by queries trying to acquire the small, medium or large semaphore respectively. There is no limit on the number of concurrent compilations for small queries. Medium queries need to acquire the small query gateway to proceed compilation and there can be a max of 4 simultaneouscompiles per CPU for medium queries. Large queries need to acquire the medium query gateway to proceed compilation and there can be a max of 1 compile per CPU for large queries. Similarly extra-large queries need to acquire the large query gateway to proceed compilation and there can be a max of 1 compile per instance for extra large queries. The limit for small/medium queries is a fraction of the optimizer target (Target Memory). The limit for the large queries is a fraction of the overall memory (Overall Memory) defined by Early Termination Factor from the memory status. That info can be obtained from DBCC MEMORYSTATUS as shown below.Optimization Queue Value---------------------------------------- -----------Overall Memory 13443072Target Memory 11960320Last Notification 1Timeout 24Early Termination Factor 5All three issues described above were observed as problems during our load tests.

62Things to Double CheckConnection pooling enabled?How much connection memory are we using?Monitor perfmon: MSSQL: Memory ManagerObvious Memory or Handle leaks?Check perfmon Process counters in perfmon for .NET appServer side processes will keep memory unless under pressure Can the application handle the load?Call into dummy procedures that do nothingCheck measured application throughputTypical case: Application breaks before SQL #63Remote Calling from WCFOriginal client code: Synchronous calls in WCFEach thread must wait for network latency before proceedingAround 1ms waitingVery similar to disk I/O thread will fall asleepLots of sleeping threadsLimited to around 50 client simulations per machineInstead, use IAsyncInterface

#Synchronous and Asynchronous Operationshttp://msdn.microsoft.com/en-us/library/ms734701.aspxHow to: Implement an Asynchronous Service Operation http://msdn.microsoft.com/en-us/library/ms731177.aspxHow to: Call WCF Service Operations Asynchronouslyhttp://msdn.microsoft.com/en-us/library/ms730059.aspx

64Tuning Data ModificationDesigning Highly Scalable OLTP Systems #Database Schema Credit CardsTransactionATMAccountTransaction_IDCustomer_IDATM_IDAccount_IDTransactionDateAmountAccount_IDLastUpdateDateBalance ID_ATMID_BranchLastTransactionDateLastTransaction_IDINSERT .. VALUES (@amount)INSERT .. VALUES (-1 * @amount)UPDATE ..SET LastTransaction_ID = @ID + 1LastTransactionDate = GETDATE()

UPDATE SET Balance

10**10 rows10**5 rows10**3 rows

#In this slide I am trying to illustrate what it is that i see when I look at a database schema that needs to run fast. There are certain concerns that I will immediately raise. If adressed early in the design process they can save us a lot of trouble and code changes later. Getting the ID number is a separate tx. The other two tables are updated in one transaction.

1st alert: getting the next ID efficiently2nd alert: consistency between tables3rd alert: doing the transaction efficiently

66Summary of ConcernsTransaction table is hotLots of INSERTHow to handle ID numbers?Allocation structures in databaseAccount table must be transactionally consistent with TransactionDo I trust the developers to do this?Cannot release lock until BOTH are in syncWhat about latency of round trips for thisPotentially hot rows in Account Are some accounts touched more than othersATM Table has hot rows. Each row on average touched at least ten times per secondE.g. 10**3 rows with 10**4 transactions/secTransactionATMAccountTransaction_IDCustomer_IDATM_IDAccount_IDTransactionDateAmountAccount_IDLastUpdateDateBalance ID_ATMID_BranchLastTransactionDateLastTransaction_ID #Generating a Unique IDWhy wont this work?CREATE PROCEDURE GetID@ID INT OUTPUT@ATM_ID INTAS

DECLARE @LastTransaction_ID INT

SELECT @LastTransaction_ID = LastTransaction_IDFROM ATMWHERE ATM_ID = @ATM_ID

SET @ID = @LastTransaction_ID + 1

UPDATE ATMSET @LastTransaction_IDWHERE ATM_ID = @ATM_ID #Concurrency is FunSELECT @LastTransaction_ID = LastTransaction_IDFROM ATMWHERE ATM_ID = 13


UPDATE ATMSET @LastTransaction_ID = @IDWHERE ATM_ID = 13

SELECT @LastTransaction_ID = LastTransaction_IDFROM ATMWHERE ATM_ID = 13


UPDATE ATMSET @LastTransaction_ID = @IDWHERE ATM_ID = 13

ATMID_ATM = 13LastTransaction_ID = 42(@LastTransaction_ID = 42)(@LastTransaction_ID = 42) #69Generating a Unique ID The Right wayCREATE PROCEDURE GetID@ID INT OUTPUT@ATM_ID INTAS

UPDATE ATMSET LastTransaction_ID = @ID + 1 , @ID = LastTransaction_ID WHERE ATM_ID = @ATM_IDAnd it it is simple too... #This method still has potential to introduce a concurrency issue (this is likely on a narrow row with high page density).Going forward: SQL11 sequence generators is the real solution for the maintaining these in the DB. A potential, not so great, trick (if you dont care about space or can periodically clean up). Pad the row a bit and crank down fillefactor using PAD_INDEX that should help reduce contention at all levels of the b-tree.

70Hot rows in ATMInitial runs with a few hundred ATM shows excessive waits for LCK_M_UDiagnosed in sys.dm_os_wait_statsDrilling down to individual locks using sys.dm_tran_locksInventive readers may wish to use Xevents Event objects: sqlserver.lock_acquired and sqlos.wait_infoBucketize themAs concurrency increases, lock waits keep increasingWhile throughput stays constantUntil...

#Spinning aroundDiagnosed using sys.dm_os_spinlock_statsPre SQL2008 this was DBCC SQLPERF(spinlockstats)Can dig deeper using Xevents with sqlos.spinlock_backoff eventWe are spinning for LOCK_HASH #LOCK_HASH what is it?ROWLock ManagerThread More ThreadsLOCK_HASHLCK_U- Why not go to sleep? #To see if Lock entries in the lock manager buffer can be accessed, you have to protect the memory region. So you have to acquire a spin lock to check to see if it is locked. The reason we dont go to sleep is that we release the quantum, which gets us into the wait queue (off the scheduler) and L2 & 3 caches are cleared. Thats why we spin.

Things to consider mentioning:Xevents and Debug Symbols

Note: Spinlock InformationA spinlock is a lightweight user mode synchronization object used to protect a specific data structure. It implements a busy-wait loop as it is more efficient to spend clock cycles waiting for the structure to become available than implement a delay/sleep function. In SQL 2008 R2 Lock_Hash spinlock implemented exponential backoff algorithm; SQL 11, the pending release, implements exponential backoff for all spinlock types, this is more efficient on a highly concurrent system.

73Locking at ScaleRatio between ATM machines and transactions generated too low.Can only sustain a limited number of locks/unlocks per secondDepends a LOT on NUMA hardware, memory speeds and CPU cachesEach ATM was generating 200 transactions / sec in test harnessSolution: Increase number of ATM machinesKey Takeway: If a locked resource is contended create more of it Notice: This is not SQL Server specific, any piece of code will be bound by memory speeds when access to a region must be serialized

#74Corresponded with high wait time for compile exclusive lock LCK_M_XNote: ignore SOS_SCHEDULER_YIELD Non-Fully Qualified Calls To Stored Procedures Results in SOS_CACHESTORE SpinsAlmost all sessions waiting on LCK_M_X

1

2Huge increase in Spinlocks on SOS_CACHESTORE & SOS_SUSPEND_QUEUE

34Corresponded with high wait time for compile exclusive lock LCK_M_XNote: ignore SOS_SCHEDULER_YIELD 5Root cause: The Lorenzo user was not a member of the DB_Owner role, all stored procedures were owned by DBO e.g. dbo.lorenzoprocWorkaround: add the application user to DB Owner roleLong Term Changes Required: Change Exec lorenzoproc to Exec dbo.lorenzoproc #Application Scalability: LCK_M_X COMPILE lock contention Due To Non-Fully Qualified Calls To Stored Procedures Results in SOS_CACHESTORE SpinsWhen we reached ~3000-3500 application users on the system the total CPU consumption jumped from about 10% to 100%. This was almost a vertical spike and happened in a period of seconds. Examining dm_exec_requests revealed almost all the sessions were waiting on LCK_M_X to compile a stored procedure. These sessions did not make much progress with the majority of them timing out at the application timeout of 30 seconds.

When the problem occurs we saw a huge increase in Spinlocks on SOS_CACHESTORE & SOS_SUSPEND_QUEUE. This corresponded with the high wait time forthe compile exclusive lock LCK_M_X.

SOS_Scheduler_YieldVoluntary yielding ==> when we hit a wait resource ==> Yield ==> go back into the runnable queue Explanation of SOS_Scheduler_YieldThis mean that the tasks running are generally exhausting their full quantum at which time the SOS_SCHEDULER_YIELD is recorded in SQL. SQL uses non-Preemptive multi-tasking, teaning they are voluntarily yielding when that is expired. It generally means just a busy system. Whenever a task yields the scheduler to somebody else, it enters a scheduler yield wait. The number of see there is to total amount of time that tasks are waiting in the runnable queue. Note that there is double accounting for this value. If 10 tasks are sitting in the runnable queue for a second, total wait time will be 10 seconds. If you have a lot of workers, then this number could be large. Sometimes, reducing the number of workers improve perf, but there is no magical formula to determine it.

LCK_M_X Contention Root Cause: The Lorenzo user was not a member of the DB_Owner role, all stored procedures were owned by DBO e.g. dbo.lorenzoproc. The Lorenzo applications calls stored procedures using the format Exec lorenzoprocIn this scenario the Lorenzo user does not own the object. In order to determine whether there is another object called lorenzoproc SQL acquires an exclusive lock LCK_M_Xand prepares to compile the procedure; this includes calculating the object ID, which can now be used to do an exhaustive search of the plan cache to locate a previously compiled plan. Exclusive locks are not compatible with each other, therefore in this scenario a convoy occurred on the exclusive lock, which resulted in the locks and spins detailed above. This contention results as a result of many concurrent requests attempting to execute the same procedure and that procedure call does not utilize a fully qualified naming convention. This can be addressed by either utilizing fully qualified names or making the application a member of the DB_Owner role. This behavior is described in this article: http://support.microsoft.com/kb/263889

75Hot rows in AccountThree ways to update Account table

Let application servers invoke transaction to both insert in TRANSACTION and UPDATE account Set a trigger on TRANSACTIONCreate stored proc that handles the entire transactionOption 1 has two issues:App developers may forget in all code pathsLatency of roundtrip: around 1ms i.e. no more than 1000 locks/sec possible on single rowOption 2 is better choice!Option 3 must be used in all places in app to be better than option 2. #Hot Latches!LCK waits are gone, but we are seeing very high waits for PAGELATCH_EXHigh = more than 1msWhat are we contending on?Latch a light weight semaphoreLocks are logical (transactional consistency)Latches are internal SQL Engine (memory consitency)Because rows are small (many fit a page) multiple locks may compete for one PAGELATCH

Page (8K)ROWROWROWROWLCK_ULCK_UPAGELATCH_EX #Row PaddingIn the case of the ATM table, our rows are small and fewWe can waste a bit of space to get more performanceSolution: Pad rows with CHAR column to make each row take a full page1 LCK = 1 PAGELATCHPage (8K)ROWLCK_UPAGELATCH_EXCHAR(5000)ALTER TABLE ATM ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (X) #Certain scenarios for shallow B-Trees (BizTalk Spool) row padding can shift the latch to internal structure @ACCESS_METHODS_HOBT_VIRTUAL_ROOT 78INSERT throughputTransaction table is by far the most active tableFortunately, only INSERTNo need to lock rowsBut several rows must still fit a single pageCannot pad pages there are 10**10 rows in the tableA new page will eventually be allocated, but until it is, every insert goes to same pageExpect: PAGELATCH_EX waitsAnd this is the observation #Hot page at the end of B-tree with increasing index

#Waits & LatchesDig into details with: sys.dm_os_wait_statssys.dm_os_latch_waits

wait_type% Wait TimePAGELATCH_SH86.4%PAGELATCH_EX8.2%LATCH_SH1.5%LATCH_EX1.0%LOGMGR_QUEUE0.9%CHECKPOINT_QUEUE0.8%ASYNC_NETWORK_IO0.8%WRITELOG0.4%latch_classwait_time_msACCESS_METHODS_HOBT_VIRTUAL_ROOT 156,818 LOG_MANAGER 103,316 #High-volume SQL Service Broker apps have the latch contention problem too!

LATCH_ are diagnosed in deeper detail by going to sys.dm_os_latch_waits81How to Solve INSERT hotspotHash partition the tableCreate multiple B-treesRound robin between the B-trees create more resources and less contentionDo not use a sequential keyDistribute the inserts all over the B-tree0123456hashID70,8,161,9,172,10,183,11,194,12,205,13,216,14,227,15,230-10001001- 20002001- 30003001- 4000INSERTINSERTINSERTINSERT #Schema design can solve hotspot too e.g.

For example with a large beverage chain we considering changing the index such that the store # (of which there are 8000) was appended to the front of the CI key. This naturally scales out the inserts. Might be suboptimal for scan type read patterns.

820Design Pattern: Table Hash Partitioning

Create new filegroup or use existing to hold the partitionsEqually balance over LUN using optimal layoutUse CREATE PARTITION FUNCTION command Partition the tables into #cores partitionsUse CREATE PARTITION SCHEME command Bind partition function to filegroupsAdd hash column to table (tinyint or smallint)Calculate a good hash distributionFor example, use hashbytes with modulo or binary_checksum123456253254255hash #Using hash partitioning like this has to be handled with care. Data warehouse fact tables can benefit from this optimization as it allow you to maintain and build indexes with DOP 1 (for better scan rates). It also allows you to work around healp limitation (See later)

But be careful about the extra work that can be introduced from hash partitioning. Opening many partitions has a CPU cost to it. Index seeks will have to touch every single partition (unless the seek includes the hash column)

In other words: Test this both for the loading speed and for query performance

Rule of thumb? Start with the number of partitions = number of cores. Then increase the partitions from there if needed.

European PASS Conference 200983Lab Example: Before Partitioning

Latch waits of approximately 36 ms at baseline of 99 checks/sec.

12 #Lab Example: After Partitioning**Other optimizations were applied

Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec.1

2

34 #Other optimizations were applied, but this illustrates the overall reduction in latch contention at high throughput using this technique

85Pick The Right Number of Buckets #B-Tree Root SplitNextPrevVirtual RootSHLATCH

(ACCESS_METHODSHBOT_VIRTUAL_ROOT)LCKPAGELATCHXSHSHPAGELATCHPAGELATCHEXSHSHEXSHEXEXEXEX #Root splits are expensive, although it will only affect one partition at a time. It is when many transactions cause page splits. We are suggesting the partitioning is better. 87Management of LOB DataResolving latch contention required rebuilding indexes into a new filegroupResulted in PFS contention (PAGELATCH_UP):Engine uses proportional fill algorithmMoving indexes from one filegroup to another resulted in imbalance between underlying data files in PRIMARY filegroupResolve: move hot table to dedicated filegroupNeither ALTER TABLE nor any method of index rebuild support the movement of LOB data. Technique used:Create the new filegroup and files. SELECT/INTO from the existing table into a new table. Change the default filegroup as specifying a target filegroup is not supportedINSERT...WITH (TABLOCK) SELECT will have similar behaviour without the need to change default filegroupDrop the original table and rename the newly created table to the original name.As a general best practice we advised the partner/customer to use dedicated filegroups for LOB data Dont use PRIMARY filegroupSee Paul Randal post: http://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-choosing-the-right-LOB-storage-technique.aspx

#Resolving the previously noted latch contention for this customer resulted in rebuilding some of the indexes onto a new filegroup. Contention on PFS pages surfaced in the PRIMARY filegroup as there was now an imbalance in free space between the underlying data files which impacts the engines proportional fill algorithms. The contention on PFS was related to allocations to the table which stored the vast majority of the LOB data and comprised 38% of the total data size for the entire database. This surfaced as waits on PAGELATCH_UP in the range of 30ms. In addition, some PAGELATCH_EX waits were also observed on this table.

Technique highlighted here:Not ideal for a production deployment as it 1) requires a period of time when the table is offline and 2) one would have to track changes after the SELECT/INTO and handle those as part of the switch which would directly impact #1 (the amount of time the table is offline).

Note: SELECT/INTO is needed in 2005, however in 2008 you can use INSERT...WITH (TABLOCK) SELECT to achieve same effect

88This does not make sense? Was there an existing imbalance?Was this really PAGELATCH_EX or was it PAGELATCH_UPNUMA and What to doRemember those PAGELATCH for UPDATE statements?Our solution: add more pagesImprovemnet: Get out of the PAGELATCH fast so next one can work on it

On NUMA systems, going to a foreign memory node takes at least 4-10 times more expensiveUse SysInternals CoreInfo tool

#Add a link to Slavas NUMA rebalancing recommendations.

89How does NUMA work in SQL Server?The first NUMA node to request a page will own that pageOwnership continues until page is evicted from buffer poolEvery other NUMA node that need that page will have to do foreign memory accessAdditional (SQL 2008) feature is SuperLatchUseful when page is read a lot but written rarelyOnly kicks in on 32 cores or moreThe this page is latched information is copied to all NUMA nodesAcquiring a PAGELATCH_SH only requires local NUMA accessBut: Acquiring PAGELATCH_EX must signal all NUMA nodesPerfmon object: MSSQL:LatchesNumber of SuperLatchesSuperLatch demotions / secSuperLatch promotions / secSee CSS blog post #http://blogs.msdn.com/psssql/archive/2009/01/28/hot-it-works-sql-server-superlatch-ing-sub-latches.aspxhttp://blogs.msdn.com/psssql/archive/2008/01/24/how-it-works-sql-server-2005-numa-basics.aspx

90

NUMA 3NUMA 2NUMA 1NUMA 0Effect of UPDATE on NUMA traffic0123ATM_IDUPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID

App Servers #NUMA 3NUMA 2NUMA 1NUMA 0Using NUMA affinity0123ATM_IDUPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID UPDATE ATMSET LastTransaction_ID

Port: 8000Port: 8001Port: 8002Port: 8003How to: Map TCP/IP Ports to NUMA Nodes #When setting NUMA affinity you can direct each ATM machine to a specific NUMA node. Each ATM has its own LastTransaction_ID.92Final Results and thoughts120.000 Batch Requests / sec100.000 SQL Transactions / sec50.000 SQL Write Transactions / sec

12.500 Business Transactions / sec

CPU Load: 34 CPU cores busyGiven more time, we would get the CPUs to 100%, Tune the NICs more, and work on balancing NUMA more. And of NIC, we only had two and they were loading two CPU at 100%

#Given more time, we would get the CPUs to 100%, Tuned the NICs more, worked on balancing NUMA more. 93Case Study: Online GamingApplication: Online sports betting, poker and casino play. Performance:15 million page views, 980,000 users per day Over 30 thousand database transactions per second, 500+ billion per day450,000 SQL Statements/sec on single databaseWorkload Characteristics:Multiple systems comprise the gaming experience including payment, casino games, sportsbook, etcRequire very low latency and must meet high transaction volumes - based on number of users on systemOver 100 SQL Server instances and 1,400 databases in architectureHardware/Deployment Configuration:Scale-up the payment system. HP Superdome (32-socket, 2 core; 256GB). Investigating x64Co-operative Scale-out for actual gaming activityThanks Mike. Ok, to provide some further context I will walk us through another two Extreme OLTP case studies. I think you will see a lot of similarities in terms of challenges relating to what Mike has deep-dived on and well mention a few other unique challenges. A few other interesting things of note as we go through these studies are some of the common trends exhibited, whether scale-up or scale-out architectures are implemented.

Finally, both these customers, while successful on the platform and hitting some phenomenal numbers in production/ are still running into limitations/ in terms of throughput and latency/ for where they want to go with their applications/ and that will lead us to our OLTP futures discussion.

94Case Study: Online Gaming (cont.)Other Solution Requirements:Failure is not an optionZero data loss and achieved 99.998% availabilityUse database mirroring and log shipping across datacenters to achieve HA/DR goalshttp://sqlcat.com/whitepapers/archive/2010/06/07/proven-sql-server-architectures-for-high-availability-and-disaster-recovery.aspxhttp://sqlcat.com/whitepapers/archive/2010/11/03/failure-is-not-an-option-zero-data-loss-and-high-availability.aspxUse SQL Server replication for reportingObservation Large scale of users, with low latency requirementsHot spots on heavily hit tables - page latchingScale-out helped increase transaction volume (#/sec)Writelog performance critical for transaction latency

95Other30+Other20+NewsLetter 2+SMS4+Online Gaming Infrastructure(no HA, DR & Backup shown)

User Account & Sportsbook8+Bookmaking2+

Betcache4+Casino2+VS Games2+1x2Games12+

CSM2+CMS15+

ReplOther40+

BGIDWH60+

Payment20+

Moni-toring10+

Adminis-tration20+

DWHStage50+

Internal Office, Sharepoint (300+)

ASP.NETSessions8+OLAP10+

This is not a WinMo7 its a SuperDome

96Technical Challenges and Architecting for Extreme OLTPChallengeConsideration/WorkaroundNetwork

CPU bottlenecks for network processing were observed and resolved via network tuning (RSS) Dedicated networks for backup, replication etc8 network cards for clientsConcurrencyLatch contention on heavily used tables, last page insertHash partition option caused other problem in query performance and application design.Resolution: Co-operative scale-outTransaction LogLatency on log writes Resolution: Increased throughput/decreased latency by placing transaction log on SSDsDatabase mirroring overhead very significant on Synchronous

Database and table design/SchemaLatency on IO intensive data files (including Tempdb):Resolution: Session state database on SSDs; Resolution: Betting slips/customer databases testing shardingSingle server, single database 500/tx/secSingle server, 4 databases 1,800/tx/sec (sharding)Multiple servers 2,600/tx/sec (sharding)

MonitoringSecurity monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when monitoringArchitecture/HardwareTests using x64 (8-socket; 8-core) vs. Itanium-Superdome(32-socket,dual-core)Same transaction throughputIO and backups were a bit slower

32 total network cards. 1 GB network cards. 10GB switchDedicated network switches

Enable RSS to enable multiple CPUs to process receive indications:http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspxThe next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload.

Concurrency: Latch contention moved to secondary index when used has partition.Co-operative scale-out then Hekaton.

DB Mirroring 1 ms overhead but at high tran volume significant impact.Log info: 1 ms in perfmon but in SQLDMVs still saw WRITELOG as high waittype.

Betting slips partitioned region and user. App routes request. Customer Key in Bet slips.

Tripwire based on Xevents

IO bandwidth slower, less slots on x64 vs Itanium for IO and Network (not big enough).

DB Mirroring: 1ms but significant for them as increases with workload.Log info: 1 ms in perfmon but in SQLDMVs still saw WRITELOG as high waittype.CPU memory and IO maxed out. Scale multiple servers not enough (latch contention) go w/ 12 shards.Their monitoring is via Xevents.

Info on when to place log flies on SSDs?Why perform better on multiple servers vs. single server for IO on data flies?97Case Study: Financial Stock MarketApplication: Real-time high transaction, low latency stock quotingPerformance:Over 280,000 business transactions/secOver 1 million data manipulation calls per second in single databaseLatency per business transaction under 1 millisecond. Real-time nature of data flow.Workload Characteristics:Send large batch containing multiple business transactions, parse and end up inserting all records into a large table which is constantly readUnder 20 tables in the application. Nothing generic about code/solutionHardware/Deployment Configuration:Load distributed based on alphabetical split. Co-operative Scale-out. Commodity based hardware (2-socket, quad-core pre-Nehalem) and high performance SAN. 98Case Study: Financial Stock Market (cont.)Other Solution Requirements:Mission critical to the business in terms of performance and availability. Require 99.999% uptime overall and 100% during business day.

Treat system like their mainframe operationsUtilize SQL Server HA features to help support 5 9s uptime requirement and geographical redundancySQL Server Failover Clustering for local (within datacenter) availabilityDatabase Mirroring (High Availability/Async) for geo-availabilityLocations around 300 miles apart30MB/sec log generation with no-send queue

Observation Extreme low latency and high throughput requirements with machine born data lead to hitting a number of the same bottlenecks we observed more commonly in the scale-up scenarios.

23,400 seconds must have 100% uptime.99Stock Market Architecture (1)

100Stock Market Architecture (2)

101Technical Challenges and Architecting for Extreme OLTP ChallengeConsideration/WorkaroundNetwork

Network round trip time for synchronous call from client induced latencyResolution: Batch data into single large parameter (varchar (8000)) to avoid network roundtripsConcurrencyPage Latch Contention Small tableSmall table with latching on 36 Rows on single page.Resolution: pad the rows to spread out latching to multiple pages; Performance Gain: 20%Page Latch Contention Large TableConcurrent INSERTS into incremental column (identity), last page insertResolution: Clustered Index (partition_id & identity) column; Performance Gain: 30%Heavy, long running threads contenting for time on the schedulerResolution: Map TCP/IP Ports to NUMA Nodes (http://msdn.microsoft.com/en-us/library/ms345346.aspx); Performance Gain: 20%

Transaction LogLogwaits: Resolution: Batching business transactions within a single COMMIT to avoid WRITELOG waitsTest of SSDs for log helped with latency.

Database and table design/SchemaChange decimal datatypes to money, others to intInteger based datatypes go through optimized code path; Performance Gain: 10%No RI as this has an overhead on performance. Executed in the application.

Monitoring5% overhead in running default trace aloneCollect perfmon and targeted DMV/Xevents output to repositoryArchitecture/Hardwarex32\x64 Performancex32 12% FasterApplication Is Not Memory Constrained**Interesting for Futures discussion later in presentation

-Discuss gains from mainframe until now. Mainframe looking at 50,000 trans/sec w/ 100ms latency.-SQL Server started around 16,000 per box. -Test of SSDs for log helped with latency, but didnt solve HA&DR. Mirroring is also a heavy price to pay.-Resolving latch contention increased throughput.-Reorganized columns based on access 5%? Improvement (Maybe)-Due to the heavy cost of passing through the optimizer several lookup tables were converted into case statements.12 tables were consolidated into 3 tables. Making for a rather unorthodox data model.

102QuoteFrontEnd

Quote1

Quote2

Quote4

QuoteFrontEnd

QuoteFrontEnd

Trade1

Trade2

QuoteFrontEnd

QuoteBackend

QuoteBackend

Quote Backend

QuoteFrontEnd

QuoteFrontend

Trade4

TradeFrontEnd

TradeBackend

Trade Backend

Trade Backend

Issues C-D

Issues J-N

Quote6

TradeFrontEnd

TradeFrontEnd

Issues A-B

~150 Apps

TradeFrontEnd

Issues C-D

Issues J-N

Issues A-B

TradeFrontend

Quote Lines

Trade Lines

Quote Backend

10-node cluster

Issues S-Z

Trade6

Trade Backend

Issues S-Z

Quote or Trade

Frontend server (one for each line)

SQL Server database cluster

Stored procedure performs processes message processing

System Return message table (one table per line, one row per message)

Vendor feed message table (one database per vendor line, one row per message)

Backend server (one for each vendor feed)

Each Frontend scans a Return table for new messages to send back through system

Each Frontend server receives Quote or Trade messages from a single line, handles the line interface, and calls stored procedures to perform message processing

Each Backend scans a MsgOut table for new messages to disseminate on a vendor line

Stored procedures apply business logic and produce vendor feed messages return messages

Vendor feed

Date post:	23-Feb-2016
Category:	Documents
Upload:	tamber
View:	86 times
Download:	2 times

High Scale OLTP Lessons Learned from SQLCAT Performance Labs

Documents