Rob GirardPrincipal Technical Marketing Engineer
Shawn MeyersSQL Server Principal Architect
PBO3350BUS
#VMworld #PBO3350BUS
Snapshots and SQL Server - Technical Deep Dive & Detailed Lab Findings
VMworld 2017 Content: Not fo
r publication or distri
bution
Rob Girard
Sr. Technical Marketing Engineer
PBO3350BUS:
Snapshots and SQL Server - Technical Deep Dive & Detailed Lab Findings
Shawn Meyers
SQL Server Principal Architect
VMworld 2017 Content: Not fo
r publication or distri
bution
33
About Shawn
© 2017 Tintri, Inc. All Rights Reserved.
Shawn Meyers
• SQL Server Principal Architect, practice lead
• Experience in VMware, Microsoft, SQL Server, storage infrastructure, performance tuning.
• Working in IT since 1992, SQL Server since 1996, VMware since 2009
@1dizzygoose linkedin.com/in/shawnmeyers42
VMworld 2017 Content: Not fo
r publication or distri
bution
44
About Rob
© 2017 Tintri, Inc. All Rights Reserved.
Rob Girard
• Principal Technical Marketing Engineer @ Tintri as of Jan, 2014
• Working in IT since 1997 with >10 years of VMware experience
• vExpert, VCAP4/5-DCA, VCAP4-DCD, VCP2/4/5, MCSE, CCNA AND TCSE
@robgirard www.linkedin.com/in/robgirard
VMworld 2017 Content: Not fo
r publication or distri
bution
Introduction
5
Met at SQL Elite Workshop, hosted by VMware and Tintri [April 2015]
Partnered to share expertise with different aspects of virtualization
Delivered VAP6433 Group Discussion session @ VMworld 2015
This session summarizes the research & lab behind that session
For those who want a closer look under the covers
VMworld 2017 Content: Not fo
r publication or distri
bution
Scope-creep was winning….
Scope of session& Testing
6
VMworld 2017 Content: Not fo
r publication or distri
bution
• Safe?
• Not Safe!
• Not Safe?
• Safe!
• …..
The Great Debate – Application vs Crash
7
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda© 2016 Tintri, Inc. All Rights Reserved.
8
Explain types of
snapshots
SQL Server
backup and
reovery
Factors Impacting
Snapshots
Testing setup RecommendationsTesting Results
VMworld 2017 Content: Not fo
r publication or distri
bution
Crash consistent vs application consistent
Crash Consistent
• Same concept as if pulling the power plug out of the back of the server
• SQL Server recovery can take longer depending upon what the server was doing
when the crash occurred
Application Consistent
• SQL Server will be in the same state as an OS reboot
• SQL Server startup will be the same every time
• Flushes completed transactions
© 2016 Tintri, Inc. All Rights Reserved. 43
VMworld 2017 Content: Not fo
r publication or distri
bution
Snapshot 101: Anatomy of a Snapshot
10
• Think of a snapshot as layers within PhotoShop
• Let’s use a quick visual aid to get us started…
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
Snapshot Definition
25
Warning!!! – A snapshot is not a backup
• A snapshot is a point-in-time copy of the data
that represents an image
• Can be used to recover individual items to a full
server recovery and everything in between
• Basics is a metadata file and a collection of
pointer records for a point in time
© 2016 Tintri, Inc. All Rights Reserved.
VMworld 2017 Content: Not fo
r publication or distri
bution
Their original intended purpose: Recovery
Patching
Cloning for various reasons:
• New server with a similar function
• Test/Dev/Staging Environments
• Non-Intrusive Recovery Verification
• Troubleshooting
• Moving Data
Snapshots – What can I use them for?
29© 2016 Tintri, Inc. All Rights Reserved.
VMworld 2017 Content: Not fo
r publication or distri
bution
Done at the hypervisor level
VMware snapshots
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1015180
01
02
03
04
Includes the state and data of a virtual
machine (if memory is snapped)
Can have snapshot chains
Can have snapshot consolidation,
orphan snapshots, removal issues
© 2016 Tintri, Inc. All Rights Reserved. 27
VMworld 2017 Content: Not fo
r publication or distri
bution
SQL Server supports virtualization-aware backup solutions that use VSS (volume
snapshots). For example, SQL Server supports Hyper-V backup.
Virtual machine snapshots that do not use VSS volume snapshots are not supported by
SQL Server. Any snapshot technology that does a behind-the-scenes save of a VM’s
point-in-time memory, disk, and device state without interacting with applications on the
guest using VSS may leave SQL Server in an inconsistent state.
https://support.microsoft.com/en-us/kb/956893
Short version you need to quiesce for support from Microsoft
Microsoft SQL Server Support Statement on Snapshots
© 2016 Tintri, Inc. All Rights Reserved. 28
VMworld 2017 Content: Not fo
r publication or distri
bution
Known as “Crash-Consistent” snapshots within Tintri VMstore
Storage array based snapshots
Don’t preserve VM state
Chaining is efficient, automatic, and user-proof
Very quick and imperceptible by the applications/users
Superior to a LUN snapshot as it is only the VM you want, not the churn
of everything else in the LUN
© 2016 Tintri, Inc. All Rights Reserved. 29
VMworld 2017 Content: Not fo
r publication or distri
bution
To quiesce or not to quiesce? That is the question!
© 2016 Tintri, Inc. All Rights Reserved. 30
VMworld 2017 Content: Not fo
r publication or distri
bution
To quiesce or not to quiesce? That is the question!
Windows servers use VSS
Process for bringing the OS into alignment for a proper backup
Flushes dirty pages (buffers) to disk
Primarily used for backups not for rollback snapshots
Can be used to call custom scripts – Freeze & Thaw
Performance & other implications? – Wait for test results!
Extremely minor risk of snapshot having database corruption
© 2016 Tintri, Inc. All Rights Reserved. 31
VMworld 2017 Content: Not fo
r publication or distri
bution
A sample .bat can be placed in C:\Program Files\VMware\VMware Tools\backupScripts.d)
Sample Freeze / Thaw script
© 2016 Tintri, Inc. All Rights Reserved. 32
VMworld 2017 Content: Not fo
r publication or distri
bution
• Hypervisor: vSphere 6.0U2
• Physical Servers: Cisco UCS B200 Blades (12-core Intel E5-2697’s)
• Storage: Tintri T5080 VMstore
• Virtual Machine Config:
• Windows Server 2012R2
• Microsoft SQL 2016
• 6 vCPU + 16 GB RAM
• 7+ vDisks
• Client VMs: Windows 2012R2 w/ HammerDB
Testing Environment 1
36
VMworld 2017 Content: Not fo
r publication or distri
bution
Testing methodology
© 2016 Tintri, Inc. All Rights Reserved. 47
HammerDB to populate databases and generate load
Take a variety of snapshots (vSphere & storage)
Observe and Measure:
Impact to VM during snapshot process
Recovery impact
Recovery Assessment
• Clone snapshots into VMs & power up into isolation (disconnect NIC)
• Review logs. Recovery times use time stamps from “Starting Master DB” until
“Recovery Complete”
VMworld 2017 Content: Not fo
r publication or distri
bution
Testing methodology – Taking snaps
38
VMworld 2017 Content: Not fo
r publication or distri
bution
Testing methodology – Cloning Snaps to Clones
39
VMworld 2017 Content: Not fo
r publication or distri
bution
Testing methodology: Start to End Timing
40
VMworld 2017 Content: Not fo
r publication or distri
bution
• We try to go in with a clean slate, but we all have ideas about what
the results are going to be prior
• The assumptions can pre judge the results
• Sharing so you know where we are coming from and you can judge
us
• These were written down prior to testing occurring
Test results assumptions
41
VMworld 2017 Content: Not fo
r publication or distri
bution
• Quiesce will take longer but will provide for better SQL Server
recovery
• VAAI snapshots will be faster
• Managing VM Stun time is key
• Memory snapshot is not needed; offer no added value
The assumptions
42
VMworld 2017 Content: Not fo
r publication or distri
bution
Fill ‘er Up! – Creating databases
Tests
01
© 2016 Tintri, Inc. All Rights Reserved. 53
03
02
04
Long time, no see! – Time since the last quiesced snap
On, and on, and on…. The never-ending transaction!
vsstrace.exeVMworld 2017 Content: N
ot for publicatio
n or distribution
What are we testing, and under what conditions?
• Crash-consistent vs Quiesced
• Does the size of the DB make a difference?
• Do multiple databases on the same disk make a difference?
• What about a dataspace spanning many disks?
• Snapping the SQL VM under varying levels of stress (up to &
including 100% CPU)
Test #1 – Fill ‘er Up!
© 2016 Tintri, Inc. All Rights Reserved. 54
VMworld 2017 Content: Not fo
r publication or distri
bution
The Setup:
• 1 SQL 2016 VM w/ 11 vDisks: OS, Page, Data1, Logs1, TempDB,
TempLogs, Backup, Data2, Logs2, Data3, Logs3
• 9 x HammerDB Client VMs to create databases of various sizes
• Primarily Write operations
Test #1 – Fill ‘er Up!
© 2016 Tintri, Inc. All Rights Reserved. 55
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up!
46
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up!
47
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up!
48
VMworld 2017 Content: Not fo
r publication or distri
bution
Observations:
• No errors observed on the HammerDB clients
• Screenshot of SQL Logs at time of quiesce….
Test #1 – Fill ‘er Up! [DURING]
49
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! [DURING]
50
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! [DURING]
51
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! [DURING]
52
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! (RECOVERY)
53
Observations :
• Review of SQL Logs… Screenshots
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! [RECOVERY]
54
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #1 – Fill ‘er Up! [RECOVERY]
55
1 second recovery observed
VMworld 2017 Content: Not fo
r publication or distri
bution
Observations:
• Disk I/O observed (VSS vs Crash-Consistent)
• Not much to report, but there was some I/O incurred on crash-consistent snap not
seen on VSS
Test #1 – Fill ‘er Up! [RECOVERY]
56
VMworld 2017 Content: Not fo
r publication or distri
bution
• No stuns observed for Crash-Consistent at time of snap
• VSS stuns in these tests were minimal
• Recovery:
• 1 second for VSS
• 11 seconds or crash-consistent
Test #1 – Fill ‘er Up! [CONCLUSION]
57
VMworld 2017 Content: Not fo
r publication or distri
bution
• You can snap the memory of the VM within a VMware snapshot
(enabled by default)
What about my memory? …..I forget
58
VMworld 2017 Content: Not fo
r publication or distri
bution
• Makes for longer stun times
VMware Snaps w/ Memory
© 2016 Tintri, Inc. All Rights Reserved. 37
• Can’t natively clone the snapped state, BUT, you can use Storage-
based snapshots to create a clone, and then revert. Handy to
troubleshoot a condition that would otherwise clear with a reboot.
• Not typically needed for backups but needed for recovering a VM to
an exact state, including inflight transactions.
• “Revert” snapshot rolls back VM into a running state
VMworld 2017 Content: Not fo
r publication or distri
bution
• Takes longer to create & requires Disk I/O - all provisioned VM
memory needs to be dumped to disk
VMware Snaps w/ Memory – con’t
60
VMworld 2017 Content: Not fo
r publication or distri
bution
Using agents tends to take snaps, attach to different guest for backups,
then revert the snapshot back
When error occur you have chain issues
Snapshot consolidation
© 2016 Tintri, Inc. All Rights Reserved. 39
Snapshot consolidation can be a smooth process or can take forever
Many factorsVMworld 2017 Content: N
ot for publicatio
n or distribution
Recovery time
Restore of a large
database can take a
long time with native
backups
Snaps can be back
online in minutes if not
seconds
Snaps can be mounted
to recover individual
objects
© 2016 Tintri, Inc. All Rights Reserved. 44
VMworld 2017 Content: Not fo
r publication or distri
bution
Snapping SQL Server drives
© 2016 Tintri, Inc. All Rights Reserved. 45
All data and log drives need to be snapped at the same instant
• Not an issue with Tintri Snaps!
Many times data and log are on different datastores due to different IO
patterns or to create multiple queues on vSphere host
When data and log are on different datastores, the snapshots must keep
these consistent or there can be database corruption
Not all SAN vendors offer a way to snap multiple datastores at the same
instant
VMworld 2017 Content: Not fo
r publication or distri
bution
• Does the time elapsed since the last VSS-quiesced snapshot make an
impact on crash-consistent snaps, or future VSS snaps? (ie.
database growth)
• 12 hours since the last VSS snapshot
Test #2 – Long Time, No See!
65
VMworld 2017 Content: Not fo
r publication or distri
bution
• No noticeable changes observed at time of snap compared to 12+
hours earlier, when databases first began populating
Test #2 – Long Time, No See! [DURING]
66
VMworld 2017 Content: Not fo
r publication or distri
bution
• Recovery of Crash-Consistent snapshot, 12 hours earlier: 11 seconds
Test #2 – Long Time, No See! [RECOVERY]
67
VMworld 2017 Content: Not fo
r publication or distri
bution
68
Test #2 – Long Time, No See! [RECOVERY]
• Inspect SQL logs from Crash-consistent, 12 hours since last VSS
operation
VMworld 2017 Content: Not fo
r publication or distri
bution
69
Test #2 – Long Time, No See! [RECOVERY]
• Compare to SQL Log of VSS-queisced snap…
VMworld 2017 Content: Not fo
r publication or distri
bution
70
Test #2 – Long Time, No See! [RECOVERY]
• 13 seconds crash-consistent versus 11 seconds crash-consistent 12
hours earlier (ie. within <15 mins of last VSS-quiesed snap)
• 13 seconds (crash-con) vs 1 second VSS
VMworld 2017 Content: Not fo
r publication or distri
bution
71
Test #2 – Long Time, No See!
• Let’s compare disk activity…
VMworld 2017 Content: Not fo
r publication or distri
bution
72
Test #2 – Long Time, No See!
VMworld 2017 Content: Not fo
r publication or distri
bution
73
Test #2 – Long Time, No See!
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #2 – Long Time, No See! (CONCLUSION)
74
The longer you go between
backups/VSS, the more work
that is required at recovery time
Recovery time difference is minimal
assuming decent performance
VMworld 2017 Content: Not fo
r publication or distri
bution
• SQL Server Native backup when structured properly will allow for point in time recovery to the nearest transaction or sub second
• Only completed transactions will be part of the backup
SQL Server backup and recovery
Snapshots are not backups!!!
© 2016 Tintri, Inc. All Rights Reserved. 40
• Recovering from a snapshot will either be backup consistent or crash consistent
• Many SQL Server experts warn about all snapshotsVMworld 2017 Content: N
ot for publicatio
n or distribution
DBA has the control they desire to manage their risk
Using a mixture of both snapshots and backups provide the most
flexibility
• Think about as multiple layers of protection one does not replace the other
• Depends upon SLA (business rules)
SQL Server Native Backups
© 2016 Tintri, Inc. All Rights Reserved. 41
Usually stored on separate disk subsystem
Storage corruption will not cause data loss
VMworld 2017 Content: Not fo
r publication or distri
bution
SQL Server has three recovery models
Simple – Can only be restored to last full backup
Bulk logged – Can be restored to point in time, but bulk logged
processes are not in the restore and have to be repeated.
Full – Can recover to any point in time, down to a single transaction
or even a certain millisecond
Recovery models
© 2016 Tintri, Inc. All Rights Reserved. 42
VMworld 2017 Content: Not fo
r publication or distri
bution
Snapshots can pause a virtual machine in order to
quiesce
Have seen hourly 5 minute stuns of IO (bad config)
Proper setup can make these manageable stun
Some SQL Server databases can never be stunned
Even worse in a LUN-based datastore where ALL
VMs need to be stunned… prolonging the pain
Set your phasers for VM stun time
© 2016 Tintri, Inc. All Rights Reserved. 38
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending Transaction
79Image Credit:
neverendingstory.com
VMworld 2017 Content: Not fo
r publication or distri
bution
• What happens during quiesce if there’s no clean break in active I/O ?
• Start with an ugly query that never commits
• Run it against a decent-sized database.
• We used a 5,000 warehouse HammerDB database @ 500 GB
• Sit back and wait. And wait. And wait….
• While query is running, take snaps after 20 mins, 1.5 hours and ~21 hours (7.5 million
row affected)
• Finally terminated the process after 17+ million rows were affected, 2.5 days later
Test #3 – The Never-ending transaction
80
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction
81
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction [During]
82
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction (During)
83
Increased snapshot times (LONG!)
VMware snapshot: 02:40 (compared to ~40 seconds)
Removal: 34:43 (compared to ~20 second removals)
VMworld 2017 Content: Not fo
r publication or distri
bution
• SQL log during snap removal – “I/O requests taking longer than 15s”
Test #3 – The Never-ending transaction [During]
84
VMworld 2017 Content: Not fo
r publication or distri
bution
• Both VSS quiesced snapshot AND crash-consistent snap entered into
recovery mode after boot up
• SQL database was not ready, and relatively high I/O was observed on
storage
Test #3 – The Never-ending transaction [Recovery]
85
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction [Recovery]
86
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction [Recovery]
87
VMworld 2017 Content: Not fo
r publication or distri
bution
• Analysis phase took 290x (!!!) LONGER on Crash-consistent snap
• Advantage: VSS ?
• Not quite… VSS = 7ms, crash-consistent = 2037ms
• Only 2 seconds added to recovery time…
• …Out of 16+ minute recoveries at 30K – 40K IOPS
Test #3 – The Never-ending transaction [Recovery]
88
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction [Recovery]
89
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #3 – The Never-ending transaction
90
VMworld 2017 Content: Not fo
r publication or distri
bution
• Available in the Windows SDK
• Information overload!
• ~48,600 events for “idle” SQL Server w/ 9 databases on 11 vDisks
• 37 seconds logged
• ~35,500 events for an idle SQL Server with NO custom DBs
• 35.5 seconds logged
• What to use it for? Deep inspection
• Beware the observer effect – Debugging isn’t free
• What can be found?
Test #4 – vsstrace
91
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #4 – vsstrace – Interesting Findings – DB names
92
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #4 – vsstrace – Interesting Findings – DB names
93
VMworld 2017 Content: Not fo
r publication or distri
bution
Test #4 – vsstrace – Interesting Findings – DB names
94
VMworld 2017 Content: Not fo
r publication or distri
bution
• References to SQL-specific use the SQLServerWriter by default
• Filter by the “WRITER” module to narrow down logging results (~10%
of events in our sample: 4,841 / 48,647 total events
• vssadmin list writer
• Many writers may be listed, vsstrace will tell you which is actually being used
Test #4 – vsstrace – As it relates to SQL
95
VMworld 2017 Content: Not fo
r publication or distri
bution
Ask Again: To quiesce or not to quiesce?
96
Summary of results on whether it’s worthwhile or not
Data loss (transactions
rolled back)
Supportability
Stun Times
Recovery from snapshotsVMworld 2017 Content: Not fo
r publication or distri
bution
Recommendations
97
Snapshots do not replace backup but are another tool for improved
recovery
Snapshots offer many great benefits beyond just recovery
Quiesce if supportability is key – Suggest frequent crash-consistent
with occasional VM-consistent or VSS-enabled backup job
It is a business tradeoff snapshots are for every situationVMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution