Building Distributed Systems with Riak...

Post on 27-Apr-2020

1 views 0 download

transcript

Building Distributed Systems with Riak Core

Steve VinoskiArchitect, Basho Technologies

Cambridge, MA USAhttp://www.basho.com/

@stevevinoskivinoski@ieee.org

http://steve.vinoski.net/

Monday, December 5, 11

What We’ll Cover

•Origins of Riak Core

•Abstractions and Functionality

•Getting started with Riak Core

Monday, December 5, 11

20 Years Ago:Client-Server

ClientClientClientClient

Server

Monday, December 5, 11

Early-ish Web Apps

Database

appserver

appserver

appserver

webserver

webserver

webserver

webserver

webserver

webserver

webserver

webserver

Monday, December 5, 11

Scaling Up

•Scaling up meant getting bigger boxes

•Worked for client/server and early web apps

•But couldn’t keep up with web growth

Monday, December 5, 11

Scaling Up

•Software architectures crafted for client/server...

•And later stretched for early web...

•Simply didn’t cut it for scale-out web apps

•Resulted in fragile systems with serious operational problems

Monday, December 5, 11

Scaling Out

•As businesses went from “having” websites to “being” websites:

• increasing number of commodity boxes

•eventually across multiple data centers

Monday, December 5, 11

Scaling OutChanged Everything

•Dealing with more concurrency

•And more distribution

•And more operational issues

•As well as more system failures

•While also needing higher reliability and uptime

Monday, December 5, 11

CAP Theorem

•A conjecture put forth in 2000 by Dr. Eric Brewer

• Formally proven in 2002

• In any distributed system, pick two:

•Consistency

•Availability

• Partition tolerance

Monday, December 5, 11

Partition Tolerance

•Guarantees continued system operation even when the network breaks and messages are lost

•Systems generally tend to support P

•Leaves choice of either C or A

Monday, December 5, 11

Consistency

•Distributed nodes see the same updates at the same logical time

•Hard to guarantee across a distributed system

Monday, December 5, 11

Availability

•Guarantees the system will service every read and write sent to it

•Even when things are breaking

Monday, December 5, 11

Choosing AP

•Provides read/write availability even when network breaks or nodes die

•Provides eventual consistency

•Example: Domain Name System (DNS) is an AP system

•Requires a careful look at tradeoffs

Monday, December 5, 11

Example AP Systems

•Amazon Dynamo

•Cassandra

•CouchDB

•Voldemort

•Basho Riak

Monday, December 5, 11

Handling Tradeoffs forAP Systems

Monday, December 5, 11

Assumptions

•We want to scale out

•We’re choosing to lean toward AP

•We have a networked cluster of nodes, each with local storage

Monday, December 5, 11

• Problem: how to make the system available even if nodes die or the network breaks?

• Solution:

• allow reading and writing from multiple nodes in the system

• avoid master nodes, instead make all nodes peers

Monday, December 5, 11

• Problem: if multiple nodes are involved, how do you reliably know where to read or write?

• Solution:

• assign virtual nodes (vnodes) to physical nodes

• use consistent hashing to find vnodes for reads/writes

Monday, December 5, 11

Consistent Hashing

Monday, December 5, 11

Consistent Hashing and Multi Vnode Benefits•Data is stored in multiple locations

•Loss of a node means only a single replica is lost

•No master to lose

•Adding nodes is trivial, data gets rebalanced minimally and automatically

Monday, December 5, 11

• Problem: what about availability? What if the node you write to dies or becomes inaccessible?

• Solution: sloppy quorums

• write to multiple vnodes

• attempt reads from multiple vnodes

Monday, December 5, 11

N/R/W Values

•N = number of replicas to store (on distinct nodes)

•R = number of replica responses needed for a successful read (specified per-request)

•W = number of replica responses needed for a successful write (specified per-request)

Monday, December 5, 11

N/R/W Values

Monday, December 5, 11

• Problem: what happens if a key hashes to vnodes that aren’t available?

• Solution:

• read from or write to the next available vnode (hence “sloppy” not “strict” quorums)

• eventually repair via hinted handoff

Monday, December 5, 11

N/R/W Values

Monday, December 5, 11

Hinted Handoff

•Surrogate vnode holds data for unavailable actual vnode

•Surrogate vnode keeps checking for availability of actual vnode

•Once the actual vnode is again available, surrogate hands off data to it

Monday, December 5, 11

Quorum Benefits

•Allows applications to tune consistency, availability, reliability per read or write

Monday, December 5, 11

• Problem: how do the nodes in the ring keep track of ring state?

• Solution: gossip protocol

Monday, December 5, 11

•Nodes “gossip” their view of the state of the ring to other nodes

• If a node changes its claim on the ring, it lets others know

•The overall state of the ring is thus kept consistent among all nodes in the ring

Gossip Protocol

Monday, December 5, 11

• Problem: what happens if vnode replicas get out of sync?

• Solution:

• vector clocks

• read repair

Monday, December 5, 11

• Problem: what happens if vnode replicas get out of sync?

• Solution:

• vector clocks

• read repair

Monday, December 5, 11

Vector Clocks

•Reasoning about time and causality in distributed systems is hard

• Integer timestamps don’t necessarily capture causality

•Vector clocks provide a happens-before relationship between two events

Monday, December 5, 11

Vector Clocks

•Simple data structure: [{ActorID,Counter}]

•All data has an associated vector clock, actors update their entry when making changes

•ClockA happened-before ClockB if all actor-counters in A are less than or equal to those in B

Monday, December 5, 11

Dinner Example

•Alice, Ben, Cathy, Dave exchange some email to decide when to meet for dinner

•Alice emails everyone to suggest Wednesday

Monday, December 5, 11

Dinner Example

•Ben and Dave email each other and decide Tuesday

•Cathy and Dave email each other and Cathy prefers Thursday, and Dave changes his mind and agrees

Monday, December 5, 11

Dinner Example

•Ann then pings everyone to check that Wednesday is still OK

•Ben says he and Dave prefer Tuesday

•Cathy says she and Dave prefer Thursday

•Dave doesn’t answer

Monday, December 5, 11

Dinner Example

•Ann then pings everyone to check that Wednesday is still OK

•Ben says he and Dave prefer Tuesday

•Cathy says she and Dave prefer Thursday

•Dave doesn’t answerConflict!

Monday, December 5, 11

[{Alice,1}]Wednesday

Monday, December 5, 11

[{Alice,1}]Wednesday

Ben

Cathy

Dave

Monday, December 5, 11

Ben Dave

Monday, December 5, 11

Ben Dave

[{Alice,1},{Ben,1}]Tuesday

Monday, December 5, 11

[{Alice,1},{Ben,1},{Dave,1}]Tuesday

Ben Dave

Monday, December 5, 11

Cathy

Dave

Monday, December 5, 11

[{Alice,1},{Cathy,1}]Thursday

Cathy

Dave

Monday, December 5, 11

[{Alice,1},{Cathy,1}]Thursday

Cathy

Dave

[{Alice,1},{Ben,1},{Dave,1}]Tuesday

Monday, December 5, 11

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

[{Alice,1},{Cathy,1}]Thursday

Cathy

Dave

Monday, December 5, 11

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

Cathy

Dave

Monday, December 5, 11

[{Alice,1}]Wednesday

Ben

Cathy

Dave

Monday, December 5, 11

[{Alice,1}]Wednesday

Ben

Cathy

Dave

[{Alice,1},{Ben,1},{Dave,1}]Tuesday

Monday, December 5, 11

[{Alice,1}]Wednesday

Ben

Cathy

Dave

[{Alice,1},{Ben,1},{Dave,1}]Tuesday

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

Monday, December 5, 11

[{Alice,1}]Wednesday

Ben

Cathy

[{Alice,1},{Ben,1},{Dave,1}]Tuesday

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

Monday, December 5, 11

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

Monday, December 5, 11

See: Easy!

[{Alice,1},{Ben,1},{Cathy,1},{Dave,2}]Thursday

Monday, December 5, 11

Vector Clocks are Hard

•Our example shows how vclocks can quickly grow

•Tradeoffs to keep them bounded:

•mark each entry with a timestamp

•occasionally drop old entries

•also trim vclock if too many entries

Monday, December 5, 11

• Problem: what happens if vnode replicas get out of sync?

• Solution:

• vector clocks

• read repair

Monday, December 5, 11

Read Repair

• If a read detects that a vnode has stale data, it is repaired via asynchronous update

•Helps implement eventual consistency

Monday, December 5, 11

This is Riak Core

•consistent hashing

•vector clocks

•sloppy quorums

•gossip protocols

•virtual nodes (vnodes)

•hinted handoff

Monday, December 5, 11

Riak Core Implementation

•Open source

•https://github.com/basho/riak_core

• Implemented in Erlang

•Helps you build AP systems

Monday, December 5, 11

Why Erlang?

•Erlang started in the mid-80s at Ericsson Computer Science Laboratories

•Needed a better way to program telephone switches for concurrency, fault tolerance, and hot upgrade

•Erlang released as open source in 1998 (www.erlang.org)

Monday, December 5, 11

Concurrency with Erlang

•A single Erlang VM instance can supports millions of processes

•The VM schedules these onto CPU cores

• Processes communicate via message passing

•No locks, condition variables, etc. makes programming easier

Monday, December 5, 11

Reliability with Erlang

• Apps typically consist of numerous Erlang processes (very lightweight threads)

• Some processes supervise others

• If a process dies, its supervisor can restart it

• “Let It Crash” philosophy

• Hot code loading for upgrades and fixes

Monday, December 5, 11

Distribution with Erlang

•Messaging primitives the same whether in same VM or different VM, even across a network

•No “extra” packages or libraries needed for distribution, it’s just built in

Monday, December 5, 11

Erlang Applications

• Erlang systems are composed of applications

• Erlang provides tools for creating and bundling applications, managing app dependencies

• Numerous apps can run within a single VM

• See rebar, an Erlang project build tool from Basho: https://github.com/basho/rebar

Monday, December 5, 11

Getting Started with Riak Core

Monday, December 5, 11

Riak Core Applications

•The Riak Key-Value database is an application built on riak core

•Riak Search, a full-text search capability supplied with Riak, is also an app built on riak core

Monday, December 5, 11

Riak KV

StoragePlug-ins

Protobufinterface

Riak KV

HTTPinterface

Riak Core

. . . . . .

Monday, December 5, 11

Speaking of rebar...

$ git clone git://github.com/basho/rebar.git$ cd rebar$ ./bootstrap$ cp rebar ~/bin

Monday, December 5, 11

Riak Core Templates

•https://github.com/rzezeski/rebar_riak_core

•A set of rebar templates that create a skeleton riak_core project for you

Monday, December 5, 11

Installing rebar_riak_core

$ git://github.com/rzezeski/rebar_riak_core.git$ cd rebar_riak_core$ mkdir -p ~/.rebar/templates$ cp riak* ~/.rebar/templates

Monday, December 5, 11

Using rebar_riak_core$ mkdir myapp

$ cd myapp

$ rebar create template=riak_core appid=myapp

•Creates a bunch of files, including rebar.config to control rebar builds

•Automatically makes riak_core a project dependency

Monday, December 5, 11

riak_id

• Implements something like Twitter’s Snowflake, a highly-available unique ID generator for tweet identifiers

•Must be 64 bits and roughly sortable

•https://github.com/seancribbs/riak_id

Monday, December 5, 11

Operations and Vnodes

•For a riak_core app, you need:

•to know what operations your app needs to perform

•how to implement your vnodes to perform the operations

Monday, December 5, 11

riak_id Operations

•next_id: generate a new unique ID

•ping: vnode communication check (comes from rebar_riak_core templates, we’ll ignore it)

Monday, December 5, 11

riak_id_vnode handle_command

•The riak_id_vnode:handle_command function implements next_id operation

•File apps/riak_id/src/riak_id_vnode.erl

Monday, December 5, 11

riak_pipe

•A new framework for Riak 1.0 that handles MapReduce and other tasks

•Allows tasks to be configured and spread out across the cluster

•Outputs from one task become inputs for the next

•https://github.com/basho/riak_pipe

Monday, December 5, 11

Riak Core

•consistent hashing

•vector clocks

•sloppy quorums

•gossip protocols

•virtual nodes (vnodes)

•hinted handoff

https://github.com/basho/riak_core

Monday, December 5, 11

Thanks

Monday, December 5, 11