Post on 28-Dec-2015
transcript
Deconstructing Commodity Storage
Clusters
Haryadi S. Gunawi, Nitin Agrawal,Andrea C. Arpaci-Dusseau, Remzi H.
Arpaci-DusseauUniv. of Wisconsin - Univ. of Wisconsin - MadisonMadison
Jiri SchindlerCorporatiCorporati
onon
2
Storage system
Storage system– Important components of large-scale systems– Multi-billion dollar industry
Often comprised of high-end storage servers– A big box with lots of disks inside
The simple question– How does storage server work?– Simple but hard – closed storage subsystem
design
3
Why need to know?
Better modeling– How system behaves under different workload– Example in storage industry: capacity model
for capacity planning– Model is limited if the information is limited
Product validation– Validate what product specs say– Performance numbers cannot confirm
Critical evaluation of design and implementation choices – Control what is occurring inside
4
Traditionally black box
Highly customized and proprietary hardware and OS– Hitachi Lightning, NetApp Filers, EMC Symmetrix– EMC Symmetrix: disk/cache manager, proprietary OS
Internal information is hidden behind standard interfaces
ClientClient
Storage SystemStorage System
?Acks
5
Modern graybox storage system
Cluster of commodity PCs running commodity OS– Google FS cluster, HP FAB, EMC Centera
Advantages of commodity storage clusters– Direct internal observation – visible probe points– Leverage existing standardized tools
ClientClient
Storage SystemStorage System Update DB
Update DB
6
Intra-box Techniques
Two “Intra-box” techniques– Observation– System perturbation
Two components of analysis– Deduce structure of main
communication protocol• Object Read and Write protocol
– Internal policy decisions• Caching, prefetching, write buffering, load
balancing, etc.
7
Goal and EMC Author
Objectives– Feasibility of deconstructing commodity
storage clusters, no source code– Results achieved without EMC
assistance
EMC Author– Evaluate correctness of our findings– Give insights behind their design
decisions
8
Outline
Introduction
EMC Centera Overview– Intra-box tools
Deducing Protocol– Observation and Delay Pertubation
Inferring Policies– System Perturbation
Conclusion
9
WANWAN LANLAN
Centera Topology
Client
ClientClientAN 1AN 1
AN 2AN 2
SN 1SN 1
Access
Nodes SN 2SN 2
SN 3SN 3
SN 4SN 4
SN 5SN 5
SN 6SN 6
StorageNodes
10
Storage NodeStorage Node
Commodity OS
Reiserfs
IDE driver
TCP/UDP
Linux
Access NodeAccess Node
Linux
ClientClient
Client SDKTCP
Centera Software Centera Software
Reiserfs
IDE driver
WAN
WAN
LANLAN
TCP/UDP
11
Probe Points – Observation
Internal probe points– Trace traffic using standardized
tools• tcpdump: trace network traffic• Pseudo Device Driver: trace disk traffic
Storage NodeStorage Node
Reiserfs
IDE drives
TCP/UDP
Centera Software
Access Node
Access Node
Centera SW.
TCP/UDP
ClientClient
Client SDK
TCP
tcpdump tcpdump tcpdump Pseudo
Dev. Driver
12
Probe Points – Perturbation
Perturbing system at probe points– Modified NistNet: delay particular messages– Pseudo Dev. Driver: delay disk I/O traffic– Additional Load
• CPU Load: High priority while loop• Disk Load: File copy
Storage NodeStorage Node
Reiserfs
IDE drives
TCP/UDP
Centera Software
Access Node
Access Node
Centera SW
TCP/UDP
ClientClient
TCP
tcpdump tcpdump tcpdump Pseudo Dev.
Mod. NistNet
Mod. NistNet
Mod. NistNet
Add CPU Load:while(1) {..}
Add Disk Load:cp fX fY
Client SDK
User-level Process
+ Delay
13
Outline
Introduction
EMC Centera Overview
Deducing Protocol– Observation and Delay Perturbation
Inferring Policies– System Perturbation
Conclusion
14
Understanding the protocol
Understanding Read/Write protocol– Read and Write implementations in big
distributed storage systems are not simple
– Deconstruct the protocol structure• Which pieces are involved?• Where data is sent to?• Data reliably stored, mirrored, striped?
15
Observing Write Protocol
Deconstruct protocol using passive observation– Run a series of write workload– Observe network and disk traffic– Correlation tools: convert traces into protocol structure
ClientClient
EMC CenteraEMC Centera
Access Nodes
StorageNodes
write( )
16
Observation Results Object Write Protocol
findings– Phase 1: Write request
establishment– Phase 2: Data transfer– Phase 3: Disk write,
notify other SNs, commit
– Phase 4: Series of acknowledgement
Determine general properties – Primary SN handles
generation of 2nd copy– Two new TCP
connections / object write
time
R
R
R
Write Req.
Request Ack.
Client
Access Node
Primary
SN
Secondary SN
Data Transfer
Write-Commit
Transfer Ack.
Write CompleteWrite-Commit
SoftwareACKsSoftware
ACKs
TCP Setup
Write Req
Request Ack.Request Ack.
SNvSNw
SNxSNy
TCP Setup
17
Resolving Dependencies Cannot conclude dependencies from observation
only– B after A != B depends on A
• Must delay A, and see if B is delayed
ANPrimar
y SN
Secondary SN
Secondary Commit(sc)
Primary commit(pc)
Conclude causality by delaying:-disk write traffic and-secondary commit
From observation only:Primary commit depends on secondary commit andsync. disk write
18
Delaying a Particular Message
Need to delay a particular message– Leverage packet sizes– Modify NistNet
• Delay specific message, not link
– Ex: delay sc (90 bytes)
299 bytes
ClientAccess Node
Primary SN
Secondary SN
509509161161161
289375
sc
prim. commit
321321
539
4
4
4
4
Primary SNPrimary SN
CentraStar
Linux TCP/UDP
ifsize=90
Mod
. N
istN
et
incomingpacket
no
yes
delay queue
90 bytes
19
Delaying secondary-commit
ANPrimar
y SN
Secondary SN
Secondary commit
Primary commit
Resolving first dependency– Delay secondary
commit primary commit also gets delayed
Primary commit depends on the receipt of secondary commit
+ delay
20
Delaying disk I/O traffic
Delay disk writes at primary storage node
Primary
SNSecondary-commit
Primary-commit+ Delay
Disk WritePrimary SNPrimary SN
CentraStar
ReiserFS
ifWRITE
Pse
ud
o-D
ev disk
req
no
yes
delay queue
IDE Driver
From observation and delay:Primary commit depends on secondary commit message andsync. disk write
21
Ability to analyze internal designs
Intra-box techniques: Observation and perturbation by delay– Able to deduce Object Write protocol– Give ability to analyze internal design decisions
Serial vs. Parallel– Primary SN handles the generation of 2nd copy (Serial)
vs. AN handles both 1st and 2nd (Parallel)
– EMC Centera: write throughput is more important– Decrease load on access nodes – increase write throughput
New TCP connections (internally) / object write– vs. using persistent connection to remove TCP setup cost– Prefer simplicity – no need to manage persistent conn. for all
requests
Client
Client
ANAN SN1
SN1
SN2
SN2
Client
Client
ANANSN1
SN1SN2
SN2
1 2
1
2
22
Outline
Introduction
EMC Centera Overview
Deducing Protocol
Inferring Policies– Various system perturbation
Conclusion
23
Inferring internal policies
Write policies– Level of replication, Load balancing,
Caching/buffering
Read policies– Caching, Prefetching, Load balancing
Try to infer– Is particular policy implemented?– At which level it is being implemented?
• Ex: Read Caching at Client, Access Node, Storage Node?
24
System Pertubation
Perturb the system– Delay and extra
load
4 common load-balancing factors:– CPU load
• High priority while loop
– Disk load• Background file copy
– Active TCP connection
– Network delay
Access Node
Access Node
ClientClient
write()
SN 1SN 1 SN 2SN 2
???
SN 3SN 3 SN …SN …?
CPUCPU CPUCPU
Active TCP
+ netdelay
25
Write Load Balancing
What factors determined which storage nodes are selected?
Experiment:– Observe which primary storage nodes selected– Without load: writes are balanced– With load: writes skew toward unloaded nodes
?ANAN
sn#1 Unloaded
sn#1 Unloaded
sn#2 Unloaded
sn#2 Unloaded
? sn#2 Loadedsn#2
Loaded
26
Write Load Balancing Results
NormalNo
Perturb
AdditionalCPU Load
Disk Load Network Load
IncomingNet. Delay
sn#1
sn#1
sn#2
sn#2
sn#1
sn#1
+CPU
+CPU
sn#1
sn#1
+Disk
+Disk
sn#1
sn#1
+TCP
+TCP
sn#1
sn#1
+Delay
+Delay
27
Summary of findings
Write Policies
Replication Two copies in two nodes attached to different power (reliability)
Load balancing
CPU usage (locally observable status)Network status is not incorporated
Write buffering
Storage nodes write synchronously
Read Policies
Caching Storage node only (commodity filesystem)Access node and client does not cache.
Prefetching Storage node only (commodity filesystem)Access node and client does not prefetch
Load Balancing
Not implemented in earlier versionStill reads from busy nodes
EMC Centera:Simplicity
andReliability
28
Conclusion
Intra-box:– Observe and perturb– Deconstruct protocol and infer policies– No access to source code
Power of probe points – More observation places– Ability to control the system
Systems built with more externally visible probe points– Systems more readily understood, analyzed, and
debugged– Higher-performing, more robust and reliable
computer systems
29
Questions?