DistributedSystemsDay19:PracticalConsensus
AzureStorage1998Paxos
2014Raft
2008ZAB
ConsensusinPractice
ConsistentData
TapestryNetwork
ServerB(follower)
ServerC(Follower)
ServerA(leader)
Thusfar,we’veusedconsistencyforapplicationOralltheapplication’sdata
Route TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
ConsistentApplications
ButmetadatashouldBeconsistent
ThedataandthestorageDoesn’tneedtobeconsisten
TapestryConfiguration
ConsensusinPractice
ConsistentData Tapestry
Network
ServerB(follower)
ServerC(Follower)
ServerA(leader)
Thusfar,we’veusedconsistencyforapplicationOralltheapplication’sdata
Route TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
ConsistentApplications
ButmetadatashouldBeconsistent
ThedataandthestorageDoesn’tneedtobeconsisten
TapestryConfiguration
GroupMembership
WhoisinmyTapestry cluster?
ConfigurationMetaData
WhatistheIPoftheLiteMinerMaster?HowmanynodesinRaft?
DistributedLocks Whohaslocksonafile?
HowwouldyouimplementalockwithRaft?
• PrimitivesexposedtotheFEs• Lock()• Unlock()
ServerB(follower)
ServerC(Follower)
ServerA(leader)FE
FE
HowwouldyouimplementalockwithRaft?
• PrimitivesexposedtotheFEs• Lock()• Unlock()
• Challenges:• Locksmustbeimplemented asastatemachine
• Mustunderstand log-replicationsemantics
ServerB(follower)
ServerC(Follower)
ServerA(leader)FE
FE
LifeBeforeCubbywasfarworse…
• Distributedsystemsdevelopers..• Implement Raft(wellactually Paxos)
• Applicationmustbewrittenasastatemachine• Potentialperformanceproblems
• Quorumon5iseasieroverquorumof10Knodes
• Sharedcriticalregions(Exclusive locks)• Hardtocode/understand
• Peoplethinktheycan… buttheycan’t!
BeforeChubbyCameAbout…• Lotsofdistributedsystemswithclientsinthe10,000s
• Howtodoprimaryelection?– Adhoc(noharmfromduplicatedwork)
– Operatorintervention(correctnessessential)
• Unprincipled• Disorganized• Costly• Lowavailability
WhichwouldyouprogramwithLocks?OrRaft? Designrequirements:
• Exposelockstodevelopers• Locksareeasierthan
redesigningasstatemachines• Lockscan’tbepermanent
• Iflocksarepermanentandserversfailthenlocksarelost
• Allserversinnetworkshouldnotbepartoftheservice
Tapestry=10000nodes
Rout e TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
Rout e TableBackPointers
Local <K,V>
Chubby=5nodes
TapestryConfiguration
WhatistheChubbyPaperabout?
“BuildingChubbywasanengineeringeffort…itwasnotresearch.Weclaimnonewalgorithmsortechniques.Thepurposeofthispaperistodescribewhatwedidandwhy,ratherthantoadvocateit.”• Designofconsensusservicebasedonwell-knownideas
• distributed consensus, caching,notifications, file-system interface
1998Paxos
2006Raft
2008ZAB
Chubby2001
Zookeeper2010
Etcd.2013
ConsensusProtocol
ConsensusService
ChubbyDesign
DesignDecisions:MotivatingLocks?
• Lockservicevs.consensus(Raft/Paxos)library
• Advantages:• Noneedtorewritecode
• Maintainprogramstructure,communicationpatterns• Cansupport notificationmechanism
• Smaller#ofnodes(servers)needed tomakeprogress
• Advisoryinsteadofmandatorylocks(why?):• Holdingalockcalled Fneither isnecessary toaccessthefileF,norpreventsotherclients fromdoingso
DesignDecisions:LockTypes• Coarsevs.fine-grainedlocks
• Fine-grained: grablockbeforeeveryevent• Coarse-grained: grablockforlargegroupofevents
Advantagesofcoarse-grainedlocksLessloadonlockserverLessdelaywhenlockserverfailsLesslockserversandavailabilityrequired
Advantagesoffine-grainedlocksMorelockserverloadIfneeded,couldbeimplementedonclientside
SystemStructure
• Chubbycell:asmallnumberofreplicas(e.g.,5)
• Masterisselectedusingaconsensusprotocol(e.g.,Raft)
SystemStructure• Clients
• Sendreads/writes onlytothemaster
• Communicates withmasterviaachubbylibrary
• Everyreplicaserver• Islisted inDNS• Directclients tomaster• Maintaincopiesofasimpledatabase
ReadandWrites
• Write• Masterpropagateswritetoreplica
• Replies afterthewritereaches amajority(e.g.,quorum)
• Read• Masterreplies directly,asithasmostuptodatestate
• Readsmuststillgotothemaster
ChubbyAPIandLocks
SimpleUNIX-likeFileSystemInterface• Barebonefile&directorystructure
• /ls/foo/wombat/pouch
Lock service; common to all names
Cell name
Name within cell
SimpleUNIX-likeFileSystemInterface• Barebonefile&directorystructure
• /ls/foo/wombat/pouch
• Doesnotsupport,maintain,orreveal• Movingfiles• Path-dependent permission semantics• Directorymodified times, files last-access times
Nodes• Node:afileordirectory
• Anynodecanactasanadvisoryreader/writer lock
• Anodemaybeeitherpermanentorephemeral• Ephemeral usedastemporaryfiles,e.g.,indicate aclient isalive
• Metadata• ThreenamesofACLs(R/W/change ACLname)
• Authentication build intoROC• 64-bitfilecontentchecksum
Locks• Any node can act as lock (shared or exclusive)
• Advisory (vs. mandatory)• Protect resources at remote services• No value in extra guards by mandatory locks
• Write permission needed to acquire• Prevents unprivileged reader blocking progress
LocksandSequences• Potential lockproblems indistributed systems
• AholdsalockL,issuesrequestW,thenfails• BacquiresL(becauseAfails),performsactions• Warrives(out-of-order)afterBʼsactions
• Solution1:backwardcompatible• Lockserverwillpreventotherclientsfromgettingthelockifalockbecomeinaccessibleortheholderhasfailed
• Lock-delayperiodcanbespecifiedbyclients
LocksandSequences• Potential lockproblems indistributed systems
• AholdsalockL,issuesrequestW,thenfails• BacquiresL(becauseAfails),performsactions• Warrives(out-of-order)afterBʼsactions
• Solution2:sequencer• Alockholdercanobtainasequencer fromChubby• Itattachesthesequencer toanyrequests thatitsendstootherservers
• Theotherserverscanverifythesequencer information