Full Stack High Availability -...

transcript

Full Stack High Availability In an Eventually Consistent World

Ben Coverston @bcoverston DataStax Inc

QConRio 2014

Who Am I?

• Ben Coverston

• Contributor: Apache Cassandra

• DSE Architect, DataStax Inc

Availability

“An analytics system based on Cassandra, for example, where there are no single points of failure might be

considered fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the

system is not highly available.”

Failure Before ~2010

• The website can fail

• We have a farm!

What About the Middleware?

• Make it stateless

• Spin up a bunch of them

• Who cares if one fails, we can recover.

How about the Database?

• Build a massive database server

• Scale up

How about the Database?• We can backup to tape!

• MTTR Hours, possibly days.

• We can mirror!

• Possible loss of data

• Some loss of availability during recovery

• What if we have multiple Availability Zones?

• Geographical distribution of master slave systems is not practical

No Good Option

• RDBMS Recovery is a Special Case

• RDBMS Was Not Built for Failure

• Once you shard it, you lose the benefits of an RDBMS anyway

The Problem

• Traditional databases don’t scale

• Because information is context sensitive

• When relationships matter, so does time

• When you guarantee consistency, something else has to give.

Eventual Consistency

“In an ideal world there would only be one consistency model . . . Many systems during this time took the

approach that it was better to fail the complete system than to break this transparency”[1]

— Werner Vogels CTO amazon.com

The CAP Theorem

• Dr. Eric Brewer

• Consistency

• Availability

• Partition Tolerance

Tradeoffs!

• With Distribution we have to accept Partitioning

• If you want strong consistency, any failure of a master will result in a partial outage.

Banks Use BASE, not ACID

• Basically Available, Soft State, Eventually Consistent (BASE)

• Real time transactions are preferred, but ATMs fall back to partitioned mode when disconnected.

Why Does This Work?• Templars and Hospitiallars were some of the first

modern bankers [4]

• ATM Networks try to be fully consistent

• Banks lose money when ATMs are not working

• Partitioned State Fallback

• Operations are commutative

• Risk is an actuarial problem

Building On Eventual Consistency

• Eventual Consistency means . . .

• Two queries at the same time could get different results

• If that’s bad for your application:

• Change your application logic

• Change your business model

• OR

• Don’t use eventual consistency

Seat Inventory

• Airlines are eventually consistent too!

• Aircraft are routinely oversold (because booking flights is a distributed systems problem)

• People fail to show up, the airline makes money

• Too many people show up, the airline compensates a few, the airline makes money

But What If?

• I Need Global Distribution

• Strong Consistency

• At Scale

Remember, Tradeoffs

Other Problems

• Real Time Analytics is a Challenge

• MapReduce Helps

• But it’s too slow for a many things

Lambda Architecture

• Real Time, should be Real Time

• Analytics is Batch

• Real time layer depends on processing incoming streams, or pre-aggregated data

• In a non-trivial system, CAP still affects the design

• Not limited to MapReduce

• Directed Acyclical Graph

• Easy-To-Understand Paradigm

• Compared to Hadoop

What is Spark?• Apache Project Since 2010

• Fast

• 10-100x faster than Hadoop MapReduce

• Easy

• Scala, Java, Python APIs

• A lot less code (for you to write)

• Interactive Shell

Hadoop Mapreduce WordCount

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Spark MapReduce for WordCount

scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

Just In Memory?

• Map Reduce Style Analytics

• Not Just In-Memory (though it is really good for ‘iteration’)

Cassandra + Spark

• Cassandra-Spark Driver

• Open source

• https://github.com/datastax/cassandra-driver-spark

Distributed Aggregation (A case study)

• Real time distributed counting is hard.

• At high volume, there are still open problems.

• Provide near-real time counts for increments

• Updates are non-monotonic (can come out of order)

• We want to be able to query on windows (how many events happened last week?, The week before?)

• We also want to know what is happening right now

What about Streaming?

• Storm or Spark Streaming can help for some use cases

• But to do it right, you have to get acks from an external system (spark), or block until the items get processed by something you might have integrated (storm).

• Blocking for stream processing could cause back pressure, and loss of availability.

Distributed Counting

• Aggregate over time

• Per shard

• Save deltas

• Timestamps

• Because timestamps can come out of order

• Arrival time is important

What About Commit Log Replay?

• We could mark commit log entries dirty if the stream hasn’t processed the data

• Block on discarding the segment until processing is complete.

• This will work

• But we would have to fork C* to do it.

Rule #1

• We do not fork Cassandra

What About Commit Log Replay (cont.)

• Alternatively, we would have to wait for Future Cassandra

• We could use Kafka (distributed commit log)

• But then we have two systems, and two commit logs

• ZooKeeper? . . . Just no.

Compromises

• Create a C* plugin to do aggregation

• Do it locally, on each node (on the coordinator).

• Create a separate API to query for aggregation

• Create real-time aggregates on the fly

• Store snapshotted data in C* for windowed aggregation (1s, 1h, 1d, 1w, 1m).

Deltas

• Deltas are stored

• Aggregates have to be composed of commutative operations (because we cannot recalculate everything, every time)

• Cumulative Average is a good example of a compatible streaming operation.

SHARD ARRIVED TIMESTAMP DELTA COUNT

1 1 0 50 40

1 2 0 40 30

2 1 1 10 20

2 2 2 30 40

TIMEST

COUN1 1 0 50 40

1 1 0 40 302 1 1 10 202 2 1 30 40

SHARD ARRIVED TIME DELTA COUNT

1 1 0 90 70

2 1 1 10 20

2 2 1 30 40

TIMEST

COUN1 1 0 50 40

1 1 0 40 302 1 1 10 202 2 1 30 40

SHARD ARRIVED TIME DELTA COUNT1 1 0 90 702 1 1 10 202 2 1 30 40

Aggregation Service Average T(1:2) -> 0.923

But Nodes Can Fail

• Snapshotted data is stored with RF > 1 (similar to RDDs)

• Aggregation is done by a ‘fat client’ running on each node.

• If a network partition happens, the real-time counts may be inconsistent.

• In case of a node failure, the counts may need to be repaired

Network Partition

Average T(1:2) -> 0.923

Average T(1:2) -> 2.345

The Partition Decision [3]

• Cancel the operation, and decrease availability

• Proceed with the operation, and risk inconsistency

If You Accept Eventual Consistency

• Real time aggregates may be inaccurate

• Due a network partition (may persist for hours)

• Due to latency (speed of light, network latency, small number of ms)

• In this system Historical aggregates are more reliable, because the deltas get written (and replicated) every second.

Designing for Eventual Consistency

• Partitions happen in the real world (not a myth)

• If you are building a distributed system, you have to account for possible failure.

• Define system behavior under failure conditions

• Make adjustments, set expectations

Call To Action• When building distributed systems

• Reason about concurrency

• Avoid Locking (if at all possible)

• Learn about Commutative Replicated Data Types (CRDTs)

• Learn about MultiVersion Concurrency Control (MVCC)

• Learn Functional Programming

• Scala, Clojure (lisp), whatever

• Functional programming makes distributed programming better

Things to Look At

• Cassandra (fully distributed database)

• Actor Pattern

• akka-cluster (fully distributed compute platform)

References

[1] http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

[2] http://en.wikipedia.org/wiki/CAP_theorem

[3] http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

[4] http://en.wikipedia.org/wiki/History_of_banking

Thank You!

ben@datastax.com @bcoverston

We Are Hiring!

Full Stack High Availability -...

Documents