From node.js to Scala - with a 100x performance boost

Post on 13-Feb-2017

399 views 0 download

transcript

FROM TO with a 100x perf. boost!

BY ITAMAR RAVID | MAY 3, 2016

t

AGENDA

WE’LL TALK ABOUT…

• What we do, our challenges and what led us to Scala and Akka;

• How we redesigned our core data processing service;

• Some useful lessons and patterns.

There will be relatively little node.js bashing. Promise.

t

BIGPANDA: THE ANSWER TO ALERT FATIGUE

RABBIT IS DOWN!NO FREE SPACE!

INBOUND QUEUE OVERFLOWING!OUTBOUND QUEUE OVERFLOWING!

APPLICATION HEALTH CRITICAL!TOO MANY FAILED HTTP REQS!

rabbit-1, ping

rabbit-2, disk

queue-1, size

queue-2, size

app1, health

app2, 500 codes

RabbitMQ cluster

ping disk

RabbitMQ node 3

queue size queue size

API server

health failed reqs

Correlation A

lgorithm

t

CorrelationStage

NormalizationStage

IN TERMS OF STREAMS…

RABBIT IS DOWN!NO FREE SPACE!

INBOUND QUEUE OVERFLOWING!OUTBOUND QUEUE OVERFLOWING!

APPLICATION HEALTH CRITICAL!TOO MANY FAILED HTTP REQS!

Nagios event source

Datadog event source

AppDynamics event source

rabbit-1, ping

rabbit-2, disk

queue-1, size

queue-2, size

app1, health

app2, 500 codes

RabbitMQ cluster

ping disk

RabbitMQ node 3

queue size queue size

API server

health failed reqs

Correlation A

lgorithm

CHALLENGE 1 SCALING TO MEET CUSTOMER LOAD

t

HIGH-LEVEL ARCHITECTURE

API servers

API servers

API servers

Normalization Correlation

Correlation

Correlation

RabbitMQ Exchange Normalization

Normalization

RabbitMQ Exchange

Mongo

RabbitMQ Exchange

t

USAGE OF RABBITMQ

Correlation

Correlation

Correlation

RabbitMQ Cons. Hash

Queue (Customers A, B, C)

Queue (Customers D, E, F)

Queue (Customers X, Y, Z)

Route byhash on

Customer

DATA FOR A GIVEN CUSTOMER MUST BE PROCESSED SERIALLY,

IN ORDER. SO…

t

(ALERT) STORMS

t

MEET REALITY!

Not fun!

A hiccup in a customer’s datacenter =>An entire queue is blocked

CHALLENGE 2 CORRELATION PREVIEW

t

CORRELATION

Same host, 4 hours …

MATCHING RULES

+

INCIDENTrabbit-1

ping disk

rabbit-1, ping, t=5

rabbit-1, disk, t=7

t

CORRELATION

MATCHING RULES

+

INCIDENTrabbit-1

ping disk

Same host, 4 hours 30 minutes

rabbit-1, ping, t=5

rabbit-1, disk, t=7

t

CORRELATION

MATCHING RULES

+

INCIDENT

rabbit-1, ping, t=5

rabbit-1, disk, t=7

Same host, 4 hours 30 minutes

?

t

A CORRELATION TIME-MACHINE

1 2 3 4 5 6 7 8 9 N…10

ALERTS WE’RE HERE

START FROM HERE (DC OUTAGE)

CorrelationServers

OFFSETS

t

THIS MEANS…

REPLAY DETERMINISTICFAST

SOLUTIONS!

t

EXISTING CORRELATION SOLUTION

Processing Stage

Mongo

RabbitRabbit RabbitRabbit Processing Stage

Processing Stage

PROCESSING STAGE - A NODE.JS CALLBACK.

Shared mutable state

No isolation

No replay

t

DESIRED SOLUTION

Processing Stage RabbitRabbit Processing

Stage

Mongo

Processing Stage

t

NODE.JS - PLATFORM LIMITATIONS

HEAP SIZE - LIMITED TO 1.7GB

SINGLE THREADED :-(

TypeError: undefined is not a function

t

COMPONENTS

DURABLE EVENT STREAM

PLATFORM

COMPUTING FRAMEWORK

t

ACTOR-BASED SOLUTION

Node Manager

Customer A Pipeline

KafkaReader

Algorithmrunner

MongoWriter

RabbitWriter

Customer B Pipeline

Customer C Pipeline

SUPERVISION

MESSAGING

customer_a_inputs

t

NEXT-GEN SOLUTION

Node Manager

Customer A Pipeline

KafkaReader

Algorithm runner

MongoWriter

RabbitWriter

Customer B Pipeline

Customer C Pipeline

SUPERVISION

MESSAGING

FAILURE

ISOLATION

customer_a_inputs

t

NEXT-GEN SOLUTION

Node Manager

Customer A Pipeline

KafkaReader

Algorithmrunner

MongoWriter

RabbitWriter

Customer B Pipeline

Customer C Pipeline

SUPERVISION

MESSAGING

SEPARATE DISPATCHERS

FOR QOS-TUNING

customer_a_inputs

t

SCALING OUT

Node 1

ClusterManager

Node Manager

Node 2

Node Manager

Node 3

Node Manager

LESSONS LEARNED

t

PRUNING AN INFINITE DATA STREAM

1 2 3 4 5 6 7 8 9 N…10

t

PRUNING AN INFINITE DATA STREAM

1 2 3 4 5 6 7 8 9 N…10

t=10, Critical t=8, OK

t

PRUNING AN INFINITE DATA STREAM

5 6 7 8 9 N…10

t=8, OK

MISSING ALERTS :-(

PRUNING STREAMS THAT RESULT IN STATE REQUIRES STATE RECOVERY.

t

PRUNING AN INFINITE DATA STREAM

5 6 7 8 9 N…10

Snapshot Repository

<data …> lastOffset: 4

<data …> lastOffset: 8

<data …> lastOffset: 10

ON BOOT, LATEST SNAPSHOT IS LOADED

AND STREAM IS SEEKED TO STORED OFFSET.

t

PRUNING AN INFINITE DATA STREAM

CHALLENGES: - COMPACTNESS - SCHEMA EVOLUTION

kryo/chill with a manual de/serializer <=> Map[String, Any]

Schema evolution support with some caveats

Big datasets are only a few MBs in size

USE SNAPSHOTS TO PRUNE STREAMS

JSON IS NOT THE ONLY SOLUTION!

KEY TAKEAWAYS

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES1

PIPELINING BETWEEN STAGES

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES2 1

PIPELINING BETWEEN STAGES

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES3 2 1

PIPELINING BETWEEN STAGES

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

INPUTS

MSG BATCHES3 2 1

PIPELINING BETWEEN STAGES

RETRYING

Persistent failurewill restart entire

pipeline

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES4 3 2 1

PIPELINING BETWEEN STAGES

RETRYING

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

t

FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS

INPUTS

MSG BATCHES4 3 2 1

PIPELINING BETWEEN STAGES

RETRYING

Kafka reader

Algorithm Runner

MongoWriter

RabbitWriter

CAPTURE COMMON ACTOR

BEHAVIOR USING TRAITS

(BUT MAKE SURE THEY COMPOSE!)

KEY TAKEAWAYS

t

DEFERRING AND CONTROLLING STATE MUTATION

PREVIOUSLY:

Processing Stage

Mongo

Processing Stage

Processing Stage

HERE BE RACE CONDITIONS!

t

DEFERRING AND CONTROLLING STATE MUTATION

Algorithm runner

Mongo

Mongo Writer

Instructions

AN INTERPRETER

t

DEFERRING STATE MUTATION

id1 id2 id1 id1 id2 id2 id1 id2 id1 id1 id2 id2

Mongoget

set

OPTIMIZE ME!

t

FOLDING INSTRUCTIONS TO REDUCE I/O

id1 -> inst1 :: inst2 :: inst3 … :: Nil

id2 -> inst1 :: inst2 :: inst3 … :: Nil

Mongo

getMultiple setMultiple

foldLeft(initialObject)(processInstruction)

DECOUPLE STATE MUTATION FROM PROCESSING

OPTIMIZE STATE MUTATION WHEN INTERPRETING

KEY TAKEAWAYS

t

MEASURE!

Dropwizard Metrics + metrics-scala:

KEY TAKEAWAYS

INSTRUMENT AWAY!

t

FINAL NUMBERS AND BENEFITS

OVERALL RATE IMPROVMENT:

~ 16 events/s on a single node.js process at peak

1600-2500 events/s on a single pipeline at peak

ISOLATION

COMPLETE DETERMINISM

SCALABILITY

Actor-per-Customer; failure isolation

More nodes => more actors; reduced I/O

Actions determined entirely by Kafka contents;

amazing for debugging!

Q&A

WE’RE HIRING! iravid@bigpanda.io

t

GROCERY LIST

RabbitMQ - op-rabbit

MongoDB - reactivemongo

Kafka - kafka-clients

Zookeeper - curator

Dependency Injection - scaldi

Logging - log4j2, scala-logging, raven-log4j2

Metrics - Dropwizard Metrics, metrics-scala

Config - Typesafe Config

JSON - play-json

Binary serde - kryo/chill