Free Recovery: Free Recovery: A Step Towards Self-Managing A Step Towards Self-Managing StateState
Andy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Persistent hash tablesPersistent hash tables
FrontendsApp Servers
DB
LAN
LAN
KeyKey ValueValue
Yahoo! user IDYahoo! user ID User profileUser profile
ISBNISBN Amazon Amazon catalog catalog metadatametadata
Hash table
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Two state management challengesTwo state management challenges
Failure handlingFailure handling• Consistency requirements Consistency requirements
Node recovery costlyNode recovery costly
Reliable failure detectionReliable failure detection
• Relax internal consistencyRelax internal consistency
Fast, non-intrusive Fast, non-intrusive recovery (“free”)recovery (“free”)
System evolutionSystem evolution• Large data setsLarge data sets
Repartitioning is costlyRepartitioning is costly
Good resources provisioningGood resources provisioning
• Free recoveryFree recovery
Automatic, online Automatic, online repartitioningrepartitioning
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
DStore architectureDStore architecture
Dlib
LAN
Brickapp server
Dlib: exposes hash table API and is the “coordinator” for distributed operations
Brick: stores data by writing synchronously to disk
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Focusing on recoveryFocusing on recovery
Technique 1: QuorumsTechnique 1: Quorums
Tolerant to brick inconsistencyTolerant to brick inconsistency
Technique 2: Single-phase Technique 2: Single-phase writeswrites
No request relies on specific No request relies on specific bricksbricks
Simple, non-intrusive recoverySimple, non-intrusive recovery
2PC: failure between phases complicates protocol
• 2nd phase depends on particular set of bricks
• Relies on reliable failure detection
Single-phase quorum writes: can be completed by any majority of bricks
Any brick can fail at any timeAny brick can fail at any time
Write: send to all, wait for majority
Read: read from majority
OK if some bricks’ data differs
Failure = missing some writes
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Considering consistencyConsidering consistency
Dl1 B1 B2 B3
x = 0
Dl2
0
read
read
1
Dlib failure can cause a partial write, violating the quorum property
If timestamps differ, read-repair restores majority invariant
Delayed commit
write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Considering consistencyConsidering consistency
B1 B2 B3
x = 0
Dl1 Dl2
1
read
write
write(1)
A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual An individual
client’s view of client’s view of DStore is DStore is consistent with consistent with that of a single that of a single centralized server centralized server (Bayou)(Bayou)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Free recoveryBenchmark: Free recovery
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
1K
2K
3K
4K
GET
req/
sec
0K
1K
2K
3K
4K
GET
req/
sec
0
100
200
300
400
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
100
200
300
400
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
100
200Repairs/sec
0
100
200Repairs/sec
0K
2K
4K
6K
8K
GET
req/
sec
0K
2K
4K
6K
8K
GET
req/
sec
Worst-case behavior(100% cache hit rate)
Expected behavior(85% cache hit rate)
Recovery: fast and non-intrusiveRecovery: fast and non-intrusive
Bri
ck
kill
ed
Reco
very
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Automatic failure detectionBenchmark: Automatic failure detection
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50Repairs/sec
0
50Repairs/sec
0K
4K
8K
GET
req/
sec
0K
4K
8K
GET
req/
sec
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50Repairs/sec
0
50Repairs/sec
0K
4K
8K
GET
req/
sec
0K
4K
8K
GET
req/
sec
Modest policy(anomaly threshold = 8)
Aggressive policy(anomaly threshold = 5)
False positives: low costFalse positives: low costFail-stutter: detected by PinpointFail-stutter: detected by PinpointFail-
stutt
er
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Online repartitioningOnline repartitioning
1.1. Take brick offlineTake brick offline
2.2. Copy data to new brickCopy data to new brick
3.3. Bring both bricks onlineBring both bricks online
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1 1
0 1 0 1 0 1
Appears as if brick just failed and recoveredAppears as if brick just failed and recovered
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Automatic online Benchmark: Automatic online repartitioningrepartitioning
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
2.5K
5K
GET
req/
sec
# bricks
3 4 5 6
0K
2.5K
5K
GET
req/
sec
# bricks
3 4 5 6
Evenly-distributed load(3 to 6 bricks)
Hotspot in 01 partition(6 to 12 bricks)
Brick selection: effectiveBrick selection: effectiveRepartitioning: non-intrusiveRepartitioning: non-intrusive
Naive
Naive
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Perform online checkpointsPerform online checkpoints Take checkpointing brick Take checkpointing brick
offlineoffline
Just like failure+recoveryJust like failure+recovery
See if free recovery can See if free recovery can simplify online data simplify online data reconstruction after hard reconstruction after hard failuresfailures
Any other state Any other state management challenges management challenges you can think of?you can think of?
Next up for free recoveryNext up for free recovery
0
50
100
150
200
0 1 2 3 4 5 6 7 8 9 10
PUT
req/s
ec
Time (minutes)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
SummarySummary
Free recoveryFree recovery
DStore = DStore = DecoupledDecoupled Storage Storage
Managed like a stateless Web farmManaged like a stateless Web farm
Quorums [spacial decoupling]
Cost: extra overprovisioning
Gain: fast, non-intrusive recovery
Single-phase ops [temporal decoupling]Cost: temporarily violates “majority” invariantGain: any brick can fail at any time
Failure handling fast, non-intrusive Mechanism: simple reboot
Policy: aggressively reboot anomalous bricks
System evolution “plug-and-play”
Mechanism: automatic, online repartitioning
Policy: dynamically add and remove nodes based on predicted load
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
[email protected]@stanford.edu