DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

Post on 15-Apr-2017

300 views 0 download

transcript

Turning Explorers into Discoverers

UNIFYING REAL TIME AND HISTORICAL ANALYTICS

WITH THE LAMBDA ARCHITECTURE

WHO YOU ARE

CTOs and VPEs Architects and Engineers Scientists and Analysts

Product

WHO I AMPeter Nachbaur

peter@keen.io / @PeterNachbaur

Currently Product and Sales at Keen IO

Past Analytics Platform Architect at at Keen IO Analytics Platform Engineer at WB Games

Java Developer at SmartBrief Cognitive Science at Vassar

AGENDA

Rise of Unified Analytics Lambda Architecture Overview

Lessons Learned at Keen

CHALLENGES IN DATA ENGINEERING

8 YEARS AGO

4 YEARS AGO

YESTERDAY

WHY NOT BOTH INDEED?

SMART DEVICES

MOBILE APPS

WEBSITES

TEAMS

CUSTOMERS

ANYWHERE

Keen IO Analytics API

insightsevents

THE STACKnginx

tornado play

kafka storm

cassandra

zookeeper memcached

redis mongo

flask react

c3

WE HAD A PROBLEM

HOW DID WE KNOW?

EXPERIENCING DIFFICULTIES

Cassandra data model

Inflexible Infrastructure Provider

Polyglot codebases

Scaling! 10x, 100x

RED QUEEN HYPOTHESIS

*WE* HAVE A PROBLEM

WHAT’S THE SOLUTION?

BUT, WAT DO?

DESIRED PROPERTIES

• robustness and fault tolerance • low latency reads (and updates) • generalization and extensibility • minimal maintenance and debuggability

LAMBDA ARCHITECTURE OVERVIEW

COMPLEXITY IS THE ENEMY OF PRODUCTIVITY

5 KEY CONCEPTS

1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier

HELPING CLOTHE THE WALRUS

1. PARALLEL INGESTION

2. BATCH LAYER

write once, bulk read often MASTER dataset creates denormalized batch views high latency

RAWNESS

IMMUTABILITY

PERPETUITY

IN BUSINESS FOR TWO YEARS! HOW MANY UNIQUE VISITORS PER

MONTH? SHIRT?

BATCH LAYER VIEWS

1 year = ~38,000,000 ranges of hours 1 year = 8760 hour buckets x1000 Shirts x1000000 Walruseses. Walri?

RECOMPUTATION VS INCREMENTAL

3. SERVING LAYER

batch updates -> batch views low latency, random reads no random writes! “stale” data simplicity

SHARD DATA INTELLIGENTLY

NEW SHOP DESIGN 6 HOURS AGO… HOW MANY UNIQUE VISITORS PER

MINUTE? SHIRT?

4. SPEED LAYER

low latency updates random writes AND reads stream processing incremental computation of transient views

SPEED LAYER OPTIONS

asynchronous or synchronous one-at-a-time or micro-batched

VIEW EXPIRATION

CUSTOMER FACING ANALYTICS… HOW MANY UNIQUE VISITORS PER

STATE? SHIRT?

5. UNIFIED QUERIES

batch view = function(master data)

realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

5 KEY CONCEPTS

1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier

BONUS CONCEPT!

EVENTUAL ACCURACY

CONCEPTUAL CRITIQUES

ALTERNATIVE?

APACHE BEAM

REALITIES OF MIGRATION AND

LESSONS LEARNED

HOW TO START A MIGRATION

WHAT DID WE HAVE?

1. kafka 2. storm 3. cassandra batch-speed layer?

WHAT DID WE NEED?

10x, 100x data volumes More flexibility Reduced Operational Burden and TCO

ADDING BATCH LAYER

WHILE YOU’RE AT IT…

GOTCHYAS

CROSS-PROVIDER NETWORKING

TOOL VERSIONING

PARALLEL INGESTION

DELETES

QUERYLIB

CULTURAL DEBT

UNIFIED ANALYTICS

LAMBDA ARCHITECTURE

PRACTICE > THEORY

CATCH { Q: QUESTIONS => …}

@PeterNachbaur pwn@keen.io

JOIN THE COMMUNITY!Analytics Slack Group -> keen.chat

Open source -> github.com/keen

Twitter -> @keen_io

IRL, right meow! Say hi to us! Ask more questions!

keen.io