Download - A chicken in every pot: a persistent snapshot memory scaled in time Liuba Shrira and Hao Xu Brandeis University.

A chicken in every pot:a persistent snapshot memory

scaled in time

Liuba Shrira and Hao Xu

Brandeis University

Storage systems: the 7 year itch

1984: rotational delay – FFS

1991: large memory - LFS

1998: cheaper disk - Elephant

2005: .. a chicken in every pot :

snapshot box on the side..

Trends

Hardware: Disk

Cheap (1$/GB) and cheaper

Software Industry: Forbes (12/2004) says:

need for keeping past state is growing

Trends cont.

- A casino chases a card counter

- IT dept. chased by Sarbanes Oxley

- Hippocratic DB audited about patient privacy preservation

Need to analyze past activity

SNAP: a snapshot system for an object storage system

Goal:

Storage system capability for

back-in-time execution (BITE):

application runs against

read-only snapshots

without synchronization analysis in retrospect

Baseline Requirements for BITE

Consistent snapshots: same (old) invariants hold

BITE of general code: after-the-fact ad-hoc analysis ( vs predefined SQL access methods)

App chooses the snapshot: snapshot state meaningful to app (vs “some time in the past” )

High time “resolution”: fine-grained past analysis (vs backup for recovery)

Over long time-scales..

Living with the past: how close?today: too close (Temporal DB, CVFS)

or too far (warehouse - Netezza)

Snapshots can be of long-term importance, or transient today: uniform - apps can not discriminate

Inherent tension: latency of access vs cost of representation (space and time) today: limited adaptation - compress or not

Capturing past states

Two ways: Cheep - no-overwrite updatepast stays put, copy new :

less to write, but bloated DB, past inherits same rep

Opportunistic- in-place update past is copied-out, separated:

more to write but can write smartly, can tailor past rep, and DB stays clustered

(vigor)

Our requirements:

Non-disruptive past: just right distance - separated

At adaptive distance: e.g. faster BITE on more recent states

Discriminated past: application classifies, snapshot system filters:

Some snapshots outlive others,some can be accessed faster

Flexible classification: e.g. after the fact

Snapshot system operations

Request to take a snapshot (declaration):sid: snapshot_request (filter_spec)

Request to access a snapshot v:snapshot_access (sid)

Request to specify a filter for a snapshot v:lazy_filter (sid,filter_spec)

T1, T2, S1, T3, T4, T5, S2,…

Baseline storage system

General interface: pages and a page tabletransactions access objects on pages

Server: DB disk: slotted pages of objects

physical oid (page#,o#)and a page table

Transaction Log Cache: pages and modifed object cache

Storage system, cont.optimistic CC+ARIES

Clientsfetch pages, run transactionssend modifed objects to server

Servervalidates, commits (WAL)caches committed modificationsno-force, no-STEAL

The snapshot system

Archive separated from DB:Archive i/o sequential, DB random

Copy-on-write (COW): copy out snapshot states into archive

just before updating DB during cleaning.

Snapshot interface

Same as DB -Snapshot PagesSnapshot Page Table

So BITE is transparent: BITE on snapshot S(v) uses PageTable(v)

Snapshot system:below the interface:

Some S(v) pages are in the archive,

some in DB

and pages in the archive can have

a different representations

BITE (v): namespace redirection

Creating non-disruptive snapshots: (i/o bound system)

Archiving snapshot states when cleaningcan slow down cleaningcompared to a system without snapshots.

Copying to the archive disk (sequential I/O)in parallel to database I/O (random)can partially hide archiving cost

behind database I/O.

Creating snapshots: how well can you hide?

Is determined by:how much is archived:

compactness of snapshot representation,frequency, snapshot update workload (overwriting)

cost of archiving, sequential, other archive traffic – BITE

Creating snapshots: some issues

Issue:avoid overwriting snapshot states

(without blocking, pinning etc)Issue:

update snapshot meta data efficiently (large, dynamic page tables )

Issue:filter out long-lived snaps (focus here)

New techniques for copy-out snapshots:

- VMOB: in-memory versioned data structure preserves snapshot states w/out blocking

- LPT: incrementally archived page table with logarithmic reconstruction cost

- Filtering: exploit smart representation for past states (focus here)

Filtering: motivation

Want unlimited past at high resolutionbut

some snapshots are transientothers of long-term interest to application

application needs to discriminate between snapshots

Thresher: a filtering system for SNAP

Snapshot representation

What can representation do for filtering?

life-time based allocation –avoids fragmentation

diff-based encoding –reduces cost of copying

adaptive combination - real winner

Example: hierarchical snapshots at multiple time granularity

ICU patient monitoring DB takes snapshots::minute by minute vital sign monitor readingshourly includes nurse’s writeup summarizing monitor readings

daily includes doctor’s notes summarizing nurse’s checkups

Doctor’s have longer life-time than nurse’s …

Brief overview: snapshot creation

Some notation:Snapshot spanRecorded pages

example:.. v4, T: w (x_P), T’: w (y_S), v5, T’’..

Span of v4 : T, T’Pages recorded by snapshot v4: P, S

Incremental snapshot creation:Archived snapshot pages: dispersed:v4 P S v5 P Q …-|-----------------------|------------------------

Archived snapshot page tables (PT):

PT(v4): addr (P4), addr(S4); PT(v5): addr(P5), addr(Q5).. …-|-----------------------|-------------------------

Another talk: how to construct archived page tables: :Construct APT (v4) = recorded (v4) + Construct APT (v5)

Filtering example: filter out short-lived v5

Doctor’s Nurse’sv4 P S v5 P Q v6 …-|-----------------------|-----------------------|- Archive Filter: long-lived v4, reclaim v5:

reclaim P5 retain Q5 (v4 needs it)

filtering incremental snapshots creates fragmentation

Problem: fragmentation

• fragmented archive, over time:non sequential archive writes

or

random reads to copy out long lived states

Our approach: filter-spec

Filter spec determines

relative snapshot lifetime

“App knows best”:

the app supplies a filter spec

the system filters

avoid fragmentation with filter-spec

Known at snapshot declaration –

use lifetime-based allocation

After the fact -

use a flexible rep to filter lazily

rep allows adaptive trade-off:

cost of filtering vs cost of BITE

App specifies filter at declaration

P4 S4 Q5 long-lived pages …-|----------------------------------------------- P5 short-lived…-|-----------------------------------------------

Invariant : to reclaim w/out fragmentation,

short-lived areas store no long-lived pages

FilterTree: filter pages for free

After-the-fact (lazy) filtering

Some applications want

to defer filter specification

Lazy filtering requires copying

We can specialize representation (compact)

to reduce copying cost

Compact representation: diffs

Two components filtered separately:

compact diffs – reduce cost of copying (diffs clustered by page)

checkpoints – accelerate BITE (page-based snapshotssystem-declared, can use FilterTree)

Adaptive trade-off

Like recovery log:

less frequent checkpoints

increase compactness

more frequent checkpoints

accelerate BITE

Lazy filtering: checkpoints filtered for free

B1

B1 B2 B3

…

… G2(diffs)

G1(diffs)

E1 E2 E3

FilterTree for checkpoints Archive regions for diff extents

E

But some applications want more:

lazy filteringandfaster BITE

e.g. - app runs BITE on batch of recent snapshotsto decide which ones to retain -

needs fast BITE to keep up..

Combined hybrid

Faster BITE in recent window

and

Lazy filtering

Hybrid: checkpoints and checkpointfiltered for free

Status

Implemented:

SNAP and Thresher for Thor storage system

Performance results –

encouraging.

here is a 5000 feet view:

Performance metrics

Cost of filtering: non-disruptiveness = rate-of-drain/ rate-of-pour

t_clean determins rate-of-drainworkload parameter: overwriting

Compactness of diff-based rep:retention relative to page-based rep

R_diff - fixed R_ckp - tunable by frequency of checkpoints

workload parameter: densityBITE - page-based snapshots, vs diff-based vs DB

Non-disruptiveness

Storage system w/hybrid snapshots vs

w/out snapshots (Thor)

How much drop in

rate-of-drain / rate-of-pour

Experimental configuration

Workoads:extend multiuser 007 to control

density overwriting

System configuration:single client, medium 007 – small DB 185MBmultiple clients – large DB 140GB

FIlterTree

Free!

Non-disruptiveness/ single client “summertime …life is easy”

Non-disruptiveness/multi user: “DB works harder”

Summary: non-disruptive snapshot memory

Unlimited filtered past

is cheaper than you may think.

.. A chicken in every pot..

Every storage system

can have a snapshot box on the side..

To get there:

Generalize:

ARIES/ STEAL / underway

file systems / need extended interfaces

Beyond:

upgrades/ have techniques

provenance / need ideas..