A chicken in every pot:a persistent snapshot memory
scaled in time
Liuba Shrira and Hao Xu
Brandeis University
Storage systems: the 7 year itch
1984: rotational delay – FFS
1991: large memory - LFS
1998: cheaper disk - Elephant
2005: .. a chicken in every pot :
snapshot box on the side..
Trends
Hardware: Disk
Cheap (1$/GB) and cheaper
Software Industry: Forbes (12/2004) says:
need for keeping past state is growing
Trends cont.
- A casino chases a card counter
- IT dept. chased by Sarbanes Oxley
- Hippocratic DB audited about patient privacy preservation
Need to analyze past activity
SNAP: a snapshot system for an object storage system
Goal:
Storage system capability for
back-in-time execution (BITE):
application runs against
read-only snapshots
without synchronization analysis in retrospect
Baseline Requirements for BITE
Consistent snapshots: same (old) invariants hold
BITE of general code: after-the-fact ad-hoc analysis ( vs predefined SQL access methods)
App chooses the snapshot: snapshot state meaningful to app (vs “some time in the past” )
High time “resolution”: fine-grained past analysis (vs backup for recovery)
Over long time-scales..
Living with the past: how close?today: too close (Temporal DB, CVFS)
or too far (warehouse - Netezza)
Snapshots can be of long-term importance, or transient today: uniform - apps can not discriminate
Inherent tension: latency of access vs cost of representation (space and time) today: limited adaptation - compress or not
Capturing past states
Two ways: Cheep - no-overwrite updatepast stays put, copy new :
less to write, but bloated DB, past inherits same rep
Opportunistic- in-place update past is copied-out, separated:
more to write but can write smartly, can tailor past rep, and DB stays clustered
(vigor)
Our requirements:
Non-disruptive past: just right distance - separated
At adaptive distance: e.g. faster BITE on more recent states
Discriminated past: application classifies, snapshot system filters:
Some snapshots outlive others,some can be accessed faster
Flexible classification: e.g. after the fact
Snapshot system operations
Request to take a snapshot (declaration):sid: snapshot_request (filter_spec)
Request to access a snapshot v:snapshot_access (sid)
Request to specify a filter for a snapshot v:lazy_filter (sid,filter_spec)
T1, T2, S1, T3, T4, T5, S2,…
Baseline storage system
General interface: pages and a page tabletransactions access objects on pages
Server: DB disk: slotted pages of objects
physical oid (page#,o#)and a page table
Transaction Log Cache: pages and modifed object cache
Storage system, cont.optimistic CC+ARIES
Clientsfetch pages, run transactionssend modifed objects to server
Servervalidates, commits (WAL)caches committed modificationsno-force, no-STEAL
The snapshot system
Archive separated from DB:Archive i/o sequential, DB random
Copy-on-write (COW): copy out snapshot states into archive
just before updating DB during cleaning.
Snapshot interface
Same as DB -Snapshot PagesSnapshot Page Table
So BITE is transparent: BITE on snapshot S(v) uses PageTable(v)
Snapshot system:below the interface:
Some S(v) pages are in the archive,
some in DB
and pages in the archive can have
a different representations
BITE (v): namespace redirection
Creating non-disruptive snapshots: (i/o bound system)
Archiving snapshot states when cleaningcan slow down cleaningcompared to a system without snapshots.
Copying to the archive disk (sequential I/O)in parallel to database I/O (random)can partially hide archiving cost
behind database I/O.
Creating snapshots: how well can you hide?
Is determined by:how much is archived:
compactness of snapshot representation,frequency, snapshot update workload (overwriting)
cost of archiving, sequential, other archive traffic – BITE
Creating snapshots: some issues
Issue:avoid overwriting snapshot states
(without blocking, pinning etc)Issue:
update snapshot meta data efficiently (large, dynamic page tables )
Issue:filter out long-lived snaps (focus here)
New techniques for copy-out snapshots:
- VMOB: in-memory versioned data structure preserves snapshot states w/out blocking
- LPT: incrementally archived page table with logarithmic reconstruction cost
- Filtering: exploit smart representation for past states (focus here)
Filtering: motivation
Want unlimited past at high resolutionbut
some snapshots are transientothers of long-term interest to application
application needs to discriminate between snapshots
Thresher: a filtering system for SNAP
Snapshot representation
What can representation do for filtering?
life-time based allocation –avoids fragmentation
diff-based encoding –reduces cost of copying
adaptive combination - real winner
Example: hierarchical snapshots at multiple time granularity
ICU patient monitoring DB takes snapshots::minute by minute vital sign monitor readingshourly includes nurse’s writeup summarizing monitor readings
daily includes doctor’s notes summarizing nurse’s checkups
Doctor’s have longer life-time than nurse’s …
Brief overview: snapshot creation
Some notation:Snapshot spanRecorded pages
example:.. v4, T: w (x_P), T’: w (y_S), v5, T’’..
Span of v4 : T, T’Pages recorded by snapshot v4: P, S
Incremental snapshot creation:Archived snapshot pages: dispersed:v4 P S v5 P Q …-|-----------------------|------------------------
Archived snapshot page tables (PT):
PT(v4): addr (P4), addr(S4); PT(v5): addr(P5), addr(Q5).. …-|-----------------------|-------------------------
Another talk: how to construct archived page tables: :Construct APT (v4) = recorded (v4) + Construct APT (v5)
Filtering example: filter out short-lived v5
Doctor’s Nurse’sv4 P S v5 P Q v6 …-|-----------------------|-----------------------|- Archive Filter: long-lived v4, reclaim v5:
reclaim P5 retain Q5 (v4 needs it)
filtering incremental snapshots creates fragmentation
Problem: fragmentation
• fragmented archive, over time:non sequential archive writes
or
random reads to copy out long lived states
Our approach: filter-spec
Filter spec determines
relative snapshot lifetime
“App knows best”:
the app supplies a filter spec
the system filters
avoid fragmentation with filter-spec
Known at snapshot declaration –
use lifetime-based allocation
After the fact -
use a flexible rep to filter lazily
rep allows adaptive trade-off:
cost of filtering vs cost of BITE
App specifies filter at declaration
P4 S4 Q5 long-lived pages …-|----------------------------------------------- P5 short-lived…-|-----------------------------------------------
Invariant : to reclaim w/out fragmentation,
short-lived areas store no long-lived pages
FilterTree: filter pages for free
After-the-fact (lazy) filtering
Some applications want
to defer filter specification
Lazy filtering requires copying
We can specialize representation (compact)
to reduce copying cost
Compact representation: diffs
Two components filtered separately:
compact diffs – reduce cost of copying (diffs clustered by page)
checkpoints – accelerate BITE (page-based snapshotssystem-declared, can use FilterTree)
Adaptive trade-off
Like recovery log:
less frequent checkpoints
increase compactness
more frequent checkpoints
accelerate BITE
Lazy filtering: checkpoints filtered for free
B1
B1 B2 B3
…
… G2(diffs)
G1(diffs)
E1 E2 E3
FilterTree for checkpoints Archive regions for diff extents
E
But some applications want more:
lazy filteringandfaster BITE
e.g. - app runs BITE on batch of recent snapshotsto decide which ones to retain -
needs fast BITE to keep up..
Combined hybrid
Faster BITE in recent window
and
Lazy filtering
Hybrid: checkpoints and checkpointfiltered for free
Status
Implemented:
SNAP and Thresher for Thor storage system
Performance results –
encouraging.
here is a 5000 feet view:
Performance metrics
Cost of filtering: non-disruptiveness = rate-of-drain/ rate-of-pour
t_clean determins rate-of-drainworkload parameter: overwriting
Compactness of diff-based rep:retention relative to page-based rep
R_diff - fixed R_ckp - tunable by frequency of checkpoints
workload parameter: densityBITE - page-based snapshots, vs diff-based vs DB
Non-disruptiveness
Storage system w/hybrid snapshots vs
w/out snapshots (Thor)
How much drop in
rate-of-drain / rate-of-pour
Experimental configuration
Workoads:extend multiuser 007 to control
density overwriting
System configuration:single client, medium 007 – small DB 185MBmultiple clients – large DB 140GB
FIlterTree
Free!
Non-disruptiveness/ single client “summertime …life is easy”
Non-disruptiveness/multi user: “DB works harder”
Summary: non-disruptive snapshot memory
Unlimited filtered past
is cheaper than you may think.
.. A chicken in every pot..
Every storage system
can have a snapshot box on the side..
To get there:
Generalize:
ARIES/ STEAL / underway
file systems / need extended interfaces
Beyond:
upgrades/ have techniques
provenance / need ideas..