+ All Categories
Home > Documents > Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C....

Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C....

Date post: 28-Dec-2015
Category:
Upload: jewel-lydia-shaw
View: 222 times
Download: 4 times
Share this document with a friend
Popular Tags:
29
Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Univ. of Wisconsin - Madison Madison Jiri Schindle r Corporati Corporati on on
Transcript
Page 1: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

Deconstructing Commodity Storage

Clusters

Haryadi S. Gunawi, Nitin Agrawal,Andrea C. Arpaci-Dusseau, Remzi H.

Arpaci-DusseauUniv. of Wisconsin - Univ. of Wisconsin - MadisonMadison

Jiri SchindlerCorporatiCorporati

onon

Page 2: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

2

Storage system

Storage system– Important components of large-scale systems– Multi-billion dollar industry

Often comprised of high-end storage servers– A big box with lots of disks inside

The simple question– How does storage server work?– Simple but hard – closed storage subsystem

design

Page 3: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

3

Why need to know?

Better modeling– How system behaves under different workload– Example in storage industry: capacity model

for capacity planning– Model is limited if the information is limited

Product validation– Validate what product specs say– Performance numbers cannot confirm

Critical evaluation of design and implementation choices – Control what is occurring inside

Page 4: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

4

Traditionally black box

Highly customized and proprietary hardware and OS– Hitachi Lightning, NetApp Filers, EMC Symmetrix– EMC Symmetrix: disk/cache manager, proprietary OS

Internal information is hidden behind standard interfaces

ClientClient

Storage SystemStorage System

?Acks

Page 5: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

5

Modern graybox storage system

Cluster of commodity PCs running commodity OS– Google FS cluster, HP FAB, EMC Centera

Advantages of commodity storage clusters– Direct internal observation – visible probe points– Leverage existing standardized tools

ClientClient

Storage SystemStorage System Update DB

Update DB

Page 6: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

6

Intra-box Techniques

Two “Intra-box” techniques– Observation– System perturbation

Two components of analysis– Deduce structure of main

communication protocol• Object Read and Write protocol

– Internal policy decisions• Caching, prefetching, write buffering, load

balancing, etc.

Page 7: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

7

Goal and EMC Author

Objectives– Feasibility of deconstructing commodity

storage clusters, no source code– Results achieved without EMC

assistance

EMC Author– Evaluate correctness of our findings– Give insights behind their design

decisions

Page 8: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

8

Outline

Introduction

EMC Centera Overview– Intra-box tools

Deducing Protocol– Observation and Delay Pertubation

Inferring Policies– System Perturbation

Conclusion

Page 9: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

9

WANWAN LANLAN

Centera Topology

Client

ClientClientAN 1AN 1

AN 2AN 2

SN 1SN 1

Access

Nodes SN 2SN 2

SN 3SN 3

SN 4SN 4

SN 5SN 5

SN 6SN 6

StorageNodes

Page 10: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

10

Storage NodeStorage Node

Commodity OS

Reiserfs

IDE driver

TCP/UDP

Linux

Access NodeAccess Node

Linux

ClientClient

Client SDKTCP

Centera Software Centera Software

Reiserfs

IDE driver

WAN

WAN

LANLAN

TCP/UDP

Page 11: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

11

Probe Points – Observation

Internal probe points– Trace traffic using standardized

tools• tcpdump: trace network traffic• Pseudo Device Driver: trace disk traffic

Storage NodeStorage Node

Reiserfs

IDE drives

TCP/UDP

Centera Software

Access Node

Access Node

Centera SW.

TCP/UDP

ClientClient

Client SDK

TCP

tcpdump tcpdump tcpdump Pseudo

Dev. Driver

Page 12: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

12

Probe Points – Perturbation

Perturbing system at probe points– Modified NistNet: delay particular messages– Pseudo Dev. Driver: delay disk I/O traffic– Additional Load

• CPU Load: High priority while loop• Disk Load: File copy

Storage NodeStorage Node

Reiserfs

IDE drives

TCP/UDP

Centera Software

Access Node

Access Node

Centera SW

TCP/UDP

ClientClient

TCP

tcpdump tcpdump tcpdump Pseudo Dev.

Mod. NistNet

Mod. NistNet

Mod. NistNet

Add CPU Load:while(1) {..}

Add Disk Load:cp fX fY

Client SDK

User-level Process

+ Delay

Page 13: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

13

Outline

Introduction

EMC Centera Overview

Deducing Protocol– Observation and Delay Perturbation

Inferring Policies– System Perturbation

Conclusion

Page 14: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

14

Understanding the protocol

Understanding Read/Write protocol– Read and Write implementations in big

distributed storage systems are not simple

– Deconstruct the protocol structure• Which pieces are involved?• Where data is sent to?• Data reliably stored, mirrored, striped?

Page 15: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

15

Observing Write Protocol

Deconstruct protocol using passive observation– Run a series of write workload– Observe network and disk traffic– Correlation tools: convert traces into protocol structure

ClientClient

EMC CenteraEMC Centera

Access Nodes

StorageNodes

write( )

Page 16: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

16

Observation Results Object Write Protocol

findings– Phase 1: Write request

establishment– Phase 2: Data transfer– Phase 3: Disk write,

notify other SNs, commit

– Phase 4: Series of acknowledgement

Determine general properties – Primary SN handles

generation of 2nd copy– Two new TCP

connections / object write

time

R

R

R

Write Req.

Request Ack.

Client

Access Node

Primary

SN

Secondary SN

Data Transfer

Write-Commit

Transfer Ack.

Write CompleteWrite-Commit

SoftwareACKsSoftware

ACKs

TCP Setup

Write Req

Request Ack.Request Ack.

SNvSNw

SNxSNy

TCP Setup

Page 17: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

17

Resolving Dependencies Cannot conclude dependencies from observation

only– B after A != B depends on A

• Must delay A, and see if B is delayed

ANPrimar

y SN

Secondary SN

Secondary Commit(sc)

Primary commit(pc)

Conclude causality by delaying:-disk write traffic and-secondary commit

From observation only:Primary commit depends on secondary commit andsync. disk write

Page 18: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

18

Delaying a Particular Message

Need to delay a particular message– Leverage packet sizes– Modify NistNet

• Delay specific message, not link

– Ex: delay sc (90 bytes)

299 bytes

ClientAccess Node

Primary SN

Secondary SN

509509161161161

289375

sc

prim. commit

321321

539

4

4

4

4

Primary SNPrimary SN

CentraStar

Linux TCP/UDP

ifsize=90

Mod

. N

istN

et

incomingpacket

no

yes

delay queue

90 bytes

Page 19: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

19

Delaying secondary-commit

ANPrimar

y SN

Secondary SN

Secondary commit

Primary commit

Resolving first dependency– Delay secondary

commit primary commit also gets delayed

Primary commit depends on the receipt of secondary commit

+ delay

Page 20: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

20

Delaying disk I/O traffic

Delay disk writes at primary storage node

Primary

SNSecondary-commit

Primary-commit+ Delay

Disk WritePrimary SNPrimary SN

CentraStar

ReiserFS

ifWRITE

Pse

ud

o-D

ev disk

req

no

yes

delay queue

IDE Driver

From observation and delay:Primary commit depends on secondary commit message andsync. disk write

Page 21: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

21

Ability to analyze internal designs

Intra-box techniques: Observation and perturbation by delay– Able to deduce Object Write protocol– Give ability to analyze internal design decisions

Serial vs. Parallel– Primary SN handles the generation of 2nd copy (Serial)

vs. AN handles both 1st and 2nd (Parallel)

– EMC Centera: write throughput is more important– Decrease load on access nodes – increase write throughput

New TCP connections (internally) / object write– vs. using persistent connection to remove TCP setup cost– Prefer simplicity – no need to manage persistent conn. for all

requests

Client

Client

ANAN SN1

SN1

SN2

SN2

Client

Client

ANANSN1

SN1SN2

SN2

1 2

1

2

Page 22: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

22

Outline

Introduction

EMC Centera Overview

Deducing Protocol

Inferring Policies– Various system perturbation

Conclusion

Page 23: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

23

Inferring internal policies

Write policies– Level of replication, Load balancing,

Caching/buffering

Read policies– Caching, Prefetching, Load balancing

Try to infer– Is particular policy implemented?– At which level it is being implemented?

• Ex: Read Caching at Client, Access Node, Storage Node?

Page 24: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

24

System Pertubation

Perturb the system– Delay and extra

load

4 common load-balancing factors:– CPU load

• High priority while loop

– Disk load• Background file copy

– Active TCP connection

– Network delay

Access Node

Access Node

ClientClient

write()

SN 1SN 1 SN 2SN 2

???

SN 3SN 3 SN …SN …?

CPUCPU CPUCPU

Active TCP

+ netdelay

Page 25: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

25

Write Load Balancing

What factors determined which storage nodes are selected?

Experiment:– Observe which primary storage nodes selected– Without load: writes are balanced– With load: writes skew toward unloaded nodes

?ANAN

sn#1 Unloaded

sn#1 Unloaded

sn#2 Unloaded

sn#2 Unloaded

? sn#2 Loadedsn#2

Loaded

Page 26: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

26

Write Load Balancing Results

NormalNo

Perturb

AdditionalCPU Load

Disk Load Network Load

IncomingNet. Delay

sn#1

sn#1

sn#2

sn#2

sn#1

sn#1

+CPU

+CPU

sn#1

sn#1

+Disk

+Disk

sn#1

sn#1

+TCP

+TCP

sn#1

sn#1

+Delay

+Delay

Page 27: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

27

Summary of findings

Write Policies

Replication Two copies in two nodes attached to different power (reliability)

Load balancing

CPU usage (locally observable status)Network status is not incorporated

Write buffering

Storage nodes write synchronously

Read Policies

Caching Storage node only (commodity filesystem)Access node and client does not cache.

Prefetching Storage node only (commodity filesystem)Access node and client does not prefetch

Load Balancing

Not implemented in earlier versionStill reads from busy nodes

EMC Centera:Simplicity

andReliability

Page 28: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

28

Conclusion

Intra-box:– Observe and perturb– Deconstruct protocol and infer policies– No access to source code

Power of probe points – More observation places– Ability to control the system

Systems built with more externally visible probe points– Systems more readily understood, analyzed, and

debugged– Higher-performing, more robust and reliable

computer systems

Page 29: Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

29

Questions?


Recommended