Designing a reactive data platform: Challenges, patterns, and anti-patterns

Post on 14-Jan-2017

398 views 0 download

transcript

DESIGNING A REACTIVE DATA PLATFORM:

CHALLENGES, PATTERNS AND

ANTI-PATTERNS

Alex Silva

Me!

Me!Me!

Distributed

Elastic

LocationAgnostic

Open

MessageDriven

Self-Healing

REACTIVE

The Reactive Manifesto

Responsive Elastic

ResilientMessageDriven

Responsiveness

Elasticity

Scaling

OUTScaling

UPVS

Elasticity

Asynchronous ShareNothing

Divide andConquer

LocationTransparency

Synchronous Messaging

Inherit ordering introducesimplicit back pressure on the sender

3

1

2

Synchronous

4Invalid!

Asynchronous

1

2

3

Asynchronous Messaging

“The ability of something to return to its original shape, after it has been pulled,stretched, pressed, or bent.”

Merriam-Webster

Resiliency

What about software systems?

WHAT IF TOLD YOU

IT IS COMPLEX BUTNOT THAT COMPLICATED

Software Systems are Complex Systems

“Complex systems run in degraded mode.”

“Complex systems run as broken systems.”

Richard Cook

Asynchronous Communication

+

Eventual Consistency

Resilient Protocols

Failures

Contained

Observed

ManagedReified as messages

Message Driven

Messages vs Events

SAVETHIS!

SOMEBODYLOGGED IN!

FactsTopic

Events

Past

AddressableSpecific

Messages

REAL-TIME DATA INGESTION PLATFORM

Why Akka?

Reactive Elastic FaultTolerant

Load Management

Both up and out

LocationTransparency

Akka Actors

Lightweight Reactive Asynchronous Resilient

Challenges with Akka

Learning Curve Type Safety Debugging Dead

Letters

Why Kafka?

Distributed Log

High Throughput Replicated Concurrency

Kafka

Producer

Producer

Kafka Cluster

Broker 2

Topic 1Partition 1

Broker 1

Topic 1Partition 0

Broker 3

Topic 1Partition 3

Client

Client

Client

Why Spark?

Fast! Unified Platform

FunctionalParadigm

Rich Library Set

ActiveCommunity

PATTERNS AND ANTI-PATTERNS

Ingestion

Hydra CoreIngestors

HTTP

Spark (Batch and Streaming)

Hydra CoreDispatchers

HTTP

RDBMS

HDFS

Conductors

Hydra CoreConductors

HTTP

Persistence :: Kafka

Hydra CorePersistence

HTTP

AKKARemoting

3

2

2

Hydra Topology

GOOD PRACTICE:

DECENTRALIZE THE PROCESSING OF KEY TASKS

HYDRA INGESTION MODULE

Actor Hierarchy

Supervision

Kafka Gateway

Message Protocol

MESSAGE HANDLERS

< META >

{ }/ingest

Coordinator

Registry

Handlers

Hydra Ingestion Flow

Handler Registry

Monitors registered handlers for errors/stops

Broadcasts messages

Handler Lifecycle

GOOD PRACTICE:

DESIGN AN INCREMENTAL COMMUNICATION PROTOCOL

Hydra Ingestion Protocol

Publish

MESSAGEHANDLERS

Join

STOP

Validate IngestValid

Invalid<<Silence>>

HEY GUYS!CHECK THIS

OUT!

HUH?! NICE!! BRINGIT!!

NAH…

Publish

JoinJoin

Hydra Ingestion Protocol: Publish

Handler Registry

Message handlers

Hydra Ingestion Protocol: Validation

HOW DOESIT LOOK?

Validate

BAD!

Invalid

GOOD!

Valid

Ingestion Coordinator

Message handlers

Hydra Ingestion Protocol: Invalid Message

Ingestion Coordinator

Error Reporter

GOT A BAD ONE

ReportError

Ingest

for

ea

ch

ha

nd

ler

Hydra Ingestion Protocol: Ingest

SHIPIT!

Ingest

Encode Persist

abstract class BaseMessageHandler extends Actor with ActorConfigSupport with ActorLogging with IngestionFlow with ProducerSupport with MessageHandler {

ingest { case Initialize => { //nothing required by default } case Publish(request) => { log.info(s"Publish message was not handled by ${self}. Will not join.") } case Validate(request) => { sender ! Validated }

case Ingest(request) => { log.warning("Ingest message was not handled by ${self}.") sender ! HandlerCompleted } case Shutdown => { //nothing required by default } case Heartbeat => { Health.get(self).getChecks } }}

GOOD PRACTICE:

HIDE AN ELASTIC POOL OF RESOURCES BEHIND ITS OWNER

Publisher SubscriberBack pressure

Less of this…

RouterPublisher

Workers

More of this!

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } }

/services-manager/kafka_producer { router = round-robin-pool resizer { lower-bound = 5 upper-bound = 50 messages-per-resize = 500 } } } }}

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } } }

provider = "akka.cluster.ClusterRefActorProvider" }

cluster { seed-nodes = ["akka.tcp://Hydra@127.0.0.1:2552","akka.tcp://hydra@172.0.0.1:2553"] }}

GOOD PRACTICE:

USE SELF-DESCRIBING MESSAGES

trait KafkaMessage[K, P] {

val timestamp = System.currentTimeMillis

def key: K

def payload: P

def retryOnFailure: Boolean = true}

case class JsonMessage(key: String, payload: JsonNode) extends KafkaMessage[String, JsonNode]

object JsonMessage { val mapper = new ObjectMapper()

def apply(key: String, json: String) = { val payload: JsonNode = mapper.readTree(json) new JsonMessage(key, payload) }}

case class AvroMessage(val schema: SchemaHolder, key: String, json: String) extends KafkaMessage[String, GenericRecord] {

def payload: GenericRecord = { val converter: JsonConverter[GenericRecord] = new JsonConverter[GenericRecord](schema.schema) converter.convert(json) }}

GOOD PRACTICE:

PREFER BINARY DATA FORMATS FOR COMMUNICATION

Why Avro?

Binary Format Space Efficient

Evolutionary Schemas

Automatic Tables

GOOD PRACTICE:

DELEGATE AND SUPERVISE! REPEAT!

Error Kernel

Ingestion Actors: Coordinators

Supervises ingestion at the request level

Coordinates protocol flow

Reports errors and metrics

GOOD PRACTICE:

LET IT CRASH

Let it Crash

Components where full restarts are always ok

Transient failures are hard to find

Simplified failure model

override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { case _: ActorInitializationException => akka.actor.SupervisorStrategy.Stop case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate case _: ConnectException => Escalate case _: Exception => Escalate }

val kafkaProducerSupervisor = BackoffSupervisor.props( Backoff.onFailure( kafkaProducerProps, childName = actorName[KafkaProducerActor], minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))

class KafkaProducerActor extends Actor with LoggingAdapter with ActorConfigSupport with NotificationSupport[KafkaMessage[Any, Any]] {

import KafkaProducerActor._

implicit val ec = context.dispatcher

override def preRestart(cause: Throwable, message: Option[Any]) = { //send it to itself again after the exponential delays, no Ack from Kafka message match { case Some(rp: RetryingProduce) => { notifyObservers(KafkaMessageNotDelivered(rp.msg)) val nextBackOff = rp.backOff.nextBackOff val retry = RetryingProduce(rp.topic, rp.msg) retry.backOff = nextBackOff context.system.scheduler.scheduleOnce(nextBackOff.waitTime, self, retry) } case Some(produce: Produce) => { notifyObservers(KafkaMessageNotDelivered(produce.msg)) if (produce.msg.retryOnFailure) { context.system.scheduler.scheduleOnce(initialDelay, self, RetryingProduce(produce.topic, produce.msg)) } } } }}

Monitoring through Death Watches

WHAT ABOUT SOME ANTI- PATTERNS?

NOT SO GOOD PRACTICE:

BUILDING NANO SERVICES

Ingestion

Hydra CoreIngestors

HTTP

Spark (Batch and Streaming)

Hydra CoreDispatchers

HTTP

RDBMS

HDFS

Conductors

Hydra CoreConductors

HTTP

Persistence :: Kafka

Hydra CorePersistence

HTTP

AKKARemoting

3

2

2

Hydra Topology

NOT SO GOOD PRACTICE:

TREATING LOCATION TRANSPARENCY AS A FREE-FOR-ALL

Guaranteed Delivery in Hydra

What does guaranteed delivery mean?

At most once semantics

Can be made stronger

Akka Remoting

Peer-to-Peer Serialization Delivery Reliability Latency

The Reliable Proxy Pattern

@throws(classOf[Exception])override def init: Future[Boolean] = Future { val useProxy = config.getBoolean(“message.proxy”,false) val ingestorPath = config.getRequiredString("ingestor.path")

ingestionActor = if (useProxy) context.actorSelection(ingestorPath) else context.actorOf(ReliableIngestionProxy.props(ingestorPath))

val cHeaders = config.getOptionalList("headers") topic = config.getRequiredString("kafka.topic") headers = cHeaders match { case Some(ch) => List( ch.unwrapped.asScala.map { header => { val sh = header.toString.split(":") RawHeader(sh(0), sh(1)) } }: _* ) case None => List.empty[HttpHeader] } true}

NOT SO GOOD PRACTICE:

NOT KEEPING MESSAGE PROTOCOL BOUND TO THEIR CONTEXTS

object Messages {

case object ServiceStarted

case class RegisterHandler(info: ActorRef)

case class RegisteredHandler(name: String, handler: ActorRef)

case class RemoveHandler(path: ActorPath)

case object GetHandlers

case object InitiateIngestion extends HydraMessage

case class RequestCompleted(s: IngestionSummary) extends HydraMessage case class IngestionSummary(name:String)

case class Produce(topic: String, msg: KafkaMessage[_, _], ack: Option[ActorRef]) extends HydraMessage

case object HandlerTimeout extends HydraMessage

case class Validate(req: HydraRequest) extends HydraMessage

case class Validated(req: HydraRequest) extends HydraMessage

case class NotValid(req: HydraRequest, reason: String) extends HydraMessage

case object HandlingCompleted extends HydraMessage

case class Publish(request: HydraRequest)

case class Ingest(request: HydraRequest)

case class Join(r: HydraRequest) extends HydraMessage}

class HandlerRegistry extends Actor with LoggingAdapter with ActorConfigSupport {

override def receive: Receive = { ... }

override val supervisorStrategy = OneForOneStrategy() { case e: Exception => { report(e) Restart } }}

object HandlerRegistry {

case class RegisterHandler(info: HandlerInfo)

case class RegisteredHandler(name: String, handler: ActorRef)

case class RemoveHandler(path: ActorPath)

case object GetHandlers}

NOT SO GOOD PRACTICE:

DEVELOPING OVERLY CHATTY PROTOCOLS

What’s next?

Co

nd

uct

ors

We

bh

oo

ksWhat’s streaming into Hydra today?

0

500

1000

1500

2000

2500

Dec-15 Jan-16 Jan-16 Jan-16 1-Feb 3/1/16

Average Ingestions Per Second

Requests

9,730 lines of Scala code

Production Platform Since Jan 2016

C.I. through Jenkins and Salt

Some Facts

roarking

QUESTIONS?

Thank You!