Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C....

transcript

Deconstructing Commodity Storage

Clusters

Haryadi S. Gunawi, Nitin Agrawal,Andrea C. Arpaci-Dusseau, Remzi H.

Arpaci-DusseauUniv. of Wisconsin - Univ. of Wisconsin - MadisonMadison

Jiri SchindlerCorporatiCorporati

Storage system

Storage system– Important components of large-scale systems– Multi-billion dollar industry

Often comprised of high-end storage servers– A big box with lots of disks inside

The simple question– How does storage server work?– Simple but hard – closed storage subsystem

design

Why need to know?

Better modeling– How system behaves under different workload– Example in storage industry: capacity model

for capacity planning– Model is limited if the information is limited

Product validation– Validate what product specs say– Performance numbers cannot confirm

Critical evaluation of design and implementation choices – Control what is occurring inside

Traditionally black box

Highly customized and proprietary hardware and OS– Hitachi Lightning, NetApp Filers, EMC Symmetrix– EMC Symmetrix: disk/cache manager, proprietary OS

Internal information is hidden behind standard interfaces

ClientClient

Storage SystemStorage System

Modern graybox storage system

Cluster of commodity PCs running commodity OS– Google FS cluster, HP FAB, EMC Centera

Advantages of commodity storage clusters– Direct internal observation – visible probe points– Leverage existing standardized tools

ClientClient

Storage SystemStorage System Update DB

Update DB

Intra-box Techniques

Two “Intra-box” techniques– Observation– System perturbation

Two components of analysis– Deduce structure of main

communication protocol• Object Read and Write protocol

– Internal policy decisions• Caching, prefetching, write buffering, load

balancing, etc.

Goal and EMC Author

Objectives– Feasibility of deconstructing commodity

storage clusters, no source code– Results achieved without EMC

assistance

EMC Author– Evaluate correctness of our findings– Give insights behind their design

decisions

Outline

Introduction

EMC Centera Overview– Intra-box tools

Deducing Protocol– Observation and Delay Pertubation

Inferring Policies– System Perturbation

Conclusion

WANWAN LANLAN

Centera Topology

Client

ClientClientAN 1AN 1

AN 2AN 2

SN 1SN 1

Access

Nodes SN 2SN 2

SN 3SN 3

SN 4SN 4

SN 5SN 5

SN 6SN 6

StorageNodes

Storage NodeStorage Node

Commodity OS

Reiserfs

IDE driver

TCP/UDP

Access NodeAccess Node

ClientClient

Client SDKTCP

Centera Software Centera Software

Reiserfs

IDE driver

LANLAN

TCP/UDP

Probe Points – Observation

Internal probe points– Trace traffic using standardized

tools• tcpdump: trace network traffic• Pseudo Device Driver: trace disk traffic

Reiserfs

IDE drives

TCP/UDP

Centera Software

Access Node

Centera SW.

TCP/UDP

ClientClient

Client SDK

tcpdump tcpdump tcpdump Pseudo

Dev. Driver

Probe Points – Perturbation

Perturbing system at probe points– Modified NistNet: delay particular messages– Pseudo Dev. Driver: delay disk I/O traffic– Additional Load

• CPU Load: High priority while loop• Disk Load: File copy

Reiserfs

IDE drives

TCP/UDP

Centera Software

Access Node

Centera SW

TCP/UDP

ClientClient

tcpdump tcpdump tcpdump Pseudo Dev.

Mod. NistNet

Add CPU Load:while(1) {..}

Add Disk Load:cp fX fY

Client SDK

User-level Process

+ Delay

Outline

Introduction

EMC Centera Overview

Deducing Protocol– Observation and Delay Perturbation

Inferring Policies– System Perturbation

Conclusion

Understanding the protocol

Understanding Read/Write protocol– Read and Write implementations in big

distributed storage systems are not simple

– Deconstruct the protocol structure• Which pieces are involved?• Where data is sent to?• Data reliably stored, mirrored, striped?

Observing Write Protocol

Deconstruct protocol using passive observation– Run a series of write workload– Observe network and disk traffic– Correlation tools: convert traces into protocol structure

ClientClient

EMC CenteraEMC Centera

Access Nodes

StorageNodes

write( )

Observation Results Object Write Protocol

findings– Phase 1: Write request

establishment– Phase 2: Data transfer– Phase 3: Disk write,

notify other SNs, commit

– Phase 4: Series of acknowledgement

Determine general properties – Primary SN handles

generation of 2nd copy– Two new TCP

connections / object write

Write Req.

Request Ack.

Client

Access Node

Primary

Secondary SN

Data Transfer

Write-Commit

Transfer Ack.

Write CompleteWrite-Commit

SoftwareACKsSoftware

TCP Setup

Write Req

Request Ack.Request Ack.

SNvSNw

SNxSNy

TCP Setup

Resolving Dependencies Cannot conclude dependencies from observation

only– B after A != B depends on A

• Must delay A, and see if B is delayed

ANPrimar

Secondary SN

Secondary Commit(sc)

Primary commit(pc)

Conclude causality by delaying:-disk write traffic and-secondary commit

From observation only:Primary commit depends on secondary commit andsync. disk write

Delaying a Particular Message

Need to delay a particular message– Leverage packet sizes– Modify NistNet

• Delay specific message, not link

– Ex: delay sc (90 bytes)

299 bytes

ClientAccess Node

Primary SN

Secondary SN

509509161161161

289375

prim. commit

321321

Primary SNPrimary SN

CentraStar

Linux TCP/UDP

ifsize=90

incomingpacket

delay queue

90 bytes

Delaying secondary-commit

ANPrimar

Secondary SN

Secondary commit

Primary commit

Resolving first dependency– Delay secondary

commit primary commit also gets delayed

Primary commit depends on the receipt of secondary commit

+ delay

Delaying disk I/O traffic

Delay disk writes at primary storage node

Primary

SNSecondary-commit

Primary-commit+ Delay

Disk WritePrimary SNPrimary SN

CentraStar

ReiserFS

ifWRITE

ev disk

delay queue

IDE Driver

From observation and delay:Primary commit depends on secondary commit message andsync. disk write

Ability to analyze internal designs

Intra-box techniques: Observation and perturbation by delay– Able to deduce Object Write protocol– Give ability to analyze internal design decisions

Serial vs. Parallel– Primary SN handles the generation of 2nd copy (Serial)

vs. AN handles both 1st and 2nd (Parallel)

– EMC Centera: write throughput is more important– Decrease load on access nodes – increase write throughput

New TCP connections (internally) / object write– vs. using persistent connection to remove TCP setup cost– Prefer simplicity – no need to manage persistent conn. for all

requests

Client

ANAN SN1

Client

ANANSN1

SN1SN2

Outline

Introduction

EMC Centera Overview

Deducing Protocol

Inferring Policies– Various system perturbation

Conclusion

Inferring internal policies

Write policies– Level of replication, Load balancing,

Caching/buffering

Read policies– Caching, Prefetching, Load balancing

Try to infer– Is particular policy implemented?– At which level it is being implemented?

• Ex: Read Caching at Client, Access Node, Storage Node?

System Pertubation

Perturb the system– Delay and extra

4 common load-balancing factors:– CPU load

• High priority while loop

– Disk load• Background file copy

– Active TCP connection

– Network delay

Access Node

ClientClient

write()

SN 1SN 1 SN 2SN 2

SN 3SN 3 SN …SN …?

CPUCPU CPUCPU

Active TCP

+ netdelay

Write Load Balancing

What factors determined which storage nodes are selected?

Experiment:– Observe which primary storage nodes selected– Without load: writes are balanced– With load: writes skew toward unloaded nodes

sn#1 Unloaded

sn#2 Unloaded

? sn#2 Loadedsn#2

Loaded

Write Load Balancing Results

NormalNo

Perturb

AdditionalCPU Load

Disk Load Network Load

IncomingNet. Delay

+Delay

Summary of findings

Write Policies

Replication Two copies in two nodes attached to different power (reliability)

Load balancing

CPU usage (locally observable status)Network status is not incorporated

Write buffering

Storage nodes write synchronously

Read Policies

Caching Storage node only (commodity filesystem)Access node and client does not cache.

Prefetching Storage node only (commodity filesystem)Access node and client does not prefetch

Load Balancing

Not implemented in earlier versionStill reads from busy nodes

EMC Centera:Simplicity

andReliability

Conclusion

Intra-box:– Observe and perturb– Deconstruct protocol and infer policies– No access to source code

Power of probe points – More observation places– Ability to control the system

Systems built with more externally visible probe points– Systems more readily understood, analyzed, and

debugged– Higher-performing, more robust and reliable

computer systems

Questions?

Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C....

Documents