Distributed Systems - Brown Universitycs.brown.edu/courses/cs138/s19/lectures/Day19_2019.pdf · Day...

Post on 19-Aug-2020

2 views 0 download

transcript

DistributedSystemsDay19:PracticalConsensus

AzureStorage1998Paxos

2014Raft

2008ZAB

ConsensusinPractice

ConsistentData

TapestryNetwork

ServerB(follower)

ServerC(Follower)

ServerA(leader)

Thusfar,we’veusedconsistencyforapplicationOralltheapplication’sdata

Route TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

ConsistentApplications

ButmetadatashouldBeconsistent

ThedataandthestorageDoesn’tneedtobeconsisten

TapestryConfiguration

ConsensusinPractice

ConsistentData Tapestry

Network

ServerB(follower)

ServerC(Follower)

ServerA(leader)

Thusfar,we’veusedconsistencyforapplicationOralltheapplication’sdata

Route TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

ConsistentApplications

ButmetadatashouldBeconsistent

ThedataandthestorageDoesn’tneedtobeconsisten

TapestryConfiguration

GroupMembership

WhoisinmyTapestry cluster?

ConfigurationMetaData

WhatistheIPoftheLiteMinerMaster?HowmanynodesinRaft?

DistributedLocks Whohaslocksonafile?

HowwouldyouimplementalockwithRaft?

• PrimitivesexposedtotheFEs• Lock()• Unlock()

ServerB(follower)

ServerC(Follower)

ServerA(leader)FE

FE

HowwouldyouimplementalockwithRaft?

• PrimitivesexposedtotheFEs• Lock()• Unlock()

• Challenges:• Locksmustbeimplemented asastatemachine

• Mustunderstand log-replicationsemantics

ServerB(follower)

ServerC(Follower)

ServerA(leader)FE

FE

LifeBeforeCubbywasfarworse…

• Distributedsystemsdevelopers..• Implement Raft(wellactually Paxos)

• Applicationmustbewrittenasastatemachine• Potentialperformanceproblems

• Quorumon5iseasieroverquorumof10Knodes

• Sharedcriticalregions(Exclusive locks)• Hardtocode/understand

• Peoplethinktheycan… buttheycan’t!

BeforeChubbyCameAbout…• Lotsofdistributedsystemswithclientsinthe10,000s

• Howtodoprimaryelection?– Adhoc(noharmfromduplicatedwork)

– Operatorintervention(correctnessessential)

• Unprincipled• Disorganized• Costly• Lowavailability

WhichwouldyouprogramwithLocks?OrRaft? Designrequirements:

• Exposelockstodevelopers• Locksareeasierthan

redesigningasstatemachines• Lockscan’tbepermanent

• Iflocksarepermanentandserversfailthenlocksarelost

• Allserversinnetworkshouldnotbepartoftheservice

Tapestry=10000nodes

Rout e TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

Rout e TableBackPointers

Local <K,V>

Chubby=5nodes

TapestryConfiguration

WhatistheChubbyPaperabout?

“BuildingChubbywasanengineeringeffort…itwasnotresearch.Weclaimnonewalgorithmsortechniques.Thepurposeofthispaperistodescribewhatwedidandwhy,ratherthantoadvocateit.”• Designofconsensusservicebasedonwell-knownideas

• distributed consensus, caching,notifications, file-system interface

1998Paxos

2006Raft

2008ZAB

Chubby2001

Zookeeper2010

Etcd.2013

ConsensusProtocol

ConsensusService

ChubbyDesign

DesignDecisions:MotivatingLocks?

• Lockservicevs.consensus(Raft/Paxos)library

• Advantages:• Noneedtorewritecode

• Maintainprogramstructure,communicationpatterns• Cansupport notificationmechanism

• Smaller#ofnodes(servers)needed tomakeprogress

• Advisoryinsteadofmandatorylocks(why?):• Holdingalockcalled Fneither isnecessary toaccessthefileF,norpreventsotherclients fromdoingso

DesignDecisions:LockTypes• Coarsevs.fine-grainedlocks

• Fine-grained: grablockbeforeeveryevent• Coarse-grained: grablockforlargegroupofevents

Advantagesofcoarse-grainedlocksLessloadonlockserverLessdelaywhenlockserverfailsLesslockserversandavailabilityrequired

Advantagesoffine-grainedlocksMorelockserverloadIfneeded,couldbeimplementedonclientside

SystemStructure

• Chubbycell:asmallnumberofreplicas(e.g.,5)

• Masterisselectedusingaconsensusprotocol(e.g.,Raft)

SystemStructure• Clients

• Sendreads/writes onlytothemaster

• Communicates withmasterviaachubbylibrary

• Everyreplicaserver• Islisted inDNS• Directclients tomaster• Maintaincopiesofasimpledatabase

ReadandWrites

• Write• Masterpropagateswritetoreplica

• Replies afterthewritereaches amajority(e.g.,quorum)

• Read• Masterreplies directly,asithasmostuptodatestate

• Readsmuststillgotothemaster

ChubbyAPIandLocks

SimpleUNIX-likeFileSystemInterface• Barebonefile&directorystructure

• /ls/foo/wombat/pouch

Lock service; common to all names

Cell name

Name within cell

SimpleUNIX-likeFileSystemInterface• Barebonefile&directorystructure

• /ls/foo/wombat/pouch

• Doesnotsupport,maintain,orreveal• Movingfiles• Path-dependent permission semantics• Directorymodified times, files last-access times

Nodes• Node:afileordirectory

• Anynodecanactasanadvisoryreader/writer lock

• Anodemaybeeitherpermanentorephemeral• Ephemeral usedastemporaryfiles,e.g.,indicate aclient isalive

• Metadata• ThreenamesofACLs(R/W/change ACLname)

• Authentication build intoROC• 64-bitfilecontentchecksum

Locks• Any node can act as lock (shared or exclusive)

• Advisory (vs. mandatory)• Protect resources at remote services• No value in extra guards by mandatory locks

• Write permission needed to acquire• Prevents unprivileged reader blocking progress

LocksandSequences• Potential lockproblems indistributed systems

• AholdsalockL,issuesrequestW,thenfails• BacquiresL(becauseAfails),performsactions• Warrives(out-of-order)afterBʼsactions

• Solution1:backwardcompatible• Lockserverwillpreventotherclientsfromgettingthelockifalockbecomeinaccessibleortheholderhasfailed

• Lock-delayperiodcanbespecifiedbyclients

LocksandSequences• Potential lockproblems indistributed systems

• AholdsalockL,issuesrequestW,thenfails• BacquiresL(becauseAfails),performsactions• Warrives(out-of-order)afterBʼsactions

• Solution2:sequencer• Alockholdercanobtainasequencer fromChubby• Itattachesthesequencer toanyrequests thatitsendstootherservers

• Theotherserverscanverifythesequencer information