+ All Categories
Home > Technology > Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Date post: 08-Jan-2017
Category:
Upload: tony-ng
View: 555 times
Download: 3 times
Share this document with a friend
34
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid September 2016 Tony Ng Director of Engineering
Transcript
Page 1: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

September 2016

Tony NgDirector of Engineering

Page 2: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

EBAYBUSINESS

Page 3: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

*Q2 2016 data

New Items

80%Fixed Price

Items

86%

Evolving from our Auction Roots..

Ships for Free

65%

Page 4: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

EBAY AT A GLANCE

$8.6B Revenue in 2015

$82BGMV in 2015

1BLive Listings *

164M Global Active Buyers *

190Countries eBay apps

are available in #

*Q2 2016 data#Q4 2015 data

326MApp downloads*

Page 5: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

CHECKOUT

WATCH LIST

SHARE

LOCAL

PREFERENCES

BID

10’s TB USER BEHAVIORAL

DATA / DAY

10’s B DATABASE

QUERIES / DAY

100’s PBDATA

CLICK

SEARCH

DAT

A

Page 6: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

ENTERPRISE DATA PLATFORM

Data StoresData Streams &

Processing

Machine Learning

Enterprise Data Ecosystem

Data Ingestion

Personalization / Optimization

Insights / Reporting

Kylin

Page 7: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Open-source real-time analytics and stream processing framework

Page 8: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar Stream

• Focus on user behavioral data processing• Complex event processing

– Streaming SQL with extensible annotations– Java

• SQL for common stream operations (Filtering, mutation, aggregation) with time windows

• Declarative topology construction• Each stage can adopt its own release and deployment cycles• Dynamic partitioning and flow control

Page 9: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Multi Stage Distributed Pipeline

Page 10: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Event Filtering and Routing Example

// create filtered streaminsert into FilteredStream select guid, evt_type, C1, C2, C3 from RawStream where evt_type = ‘bid’;

// publish and route filtered stream@PublishOn(topics=“Topic1”)@Output(“OutboundChannel”)@ClusterAffinityTag(column = guid)select * from SubStream;

Page 11: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Aggregate Computation Example

// create 10-second time window contextcreate context MCContext start @now and pattern (timer:interval(10)];

// create aggreated stream within specified time windowcontext MCContext insert into AggStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type output snapshot when terminated;

// publish aggregated streamselect * from AggStream;

Page 12: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Stream Aggregation Time Window

Sliding Window

TumblingWindow

Page 13: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

TopN Computation Example

// create 60-second time window contextcreate context MCContext start @now and pattern (timer:interval(60)];

// create topN stream via sortingcontext MCContext insert into TopNStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type order by M1 limit 10;

// publish topN streamselect * from TopNStream;

Page 14: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

PULSAR BEHAVIORAL DATA PIPELINE

Page 15: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar Behavioral Data Pipeline

Sessionizer MetricsCalculator

EventDistributor

Real TimeConsumers

Metrics Store

Collector

Real-time Pipeline

BOTDetection

EnrichedSessionized Events

ProducingApplications

Real TimeDashboard and Services

Kafka DruidHaddop / Kylin

Batch Loader

Page 16: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Sessionization: Group together events of a single user visit

e1 e2 e3 e4 e5 e7 e8

User 1: >30 min of inactivity

Session A (User 1): e1, e2, e4

Session B (User 2): e3, e5, e6

Session C (User 1): e7, e8

e6

. . .

Page 17: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Sessionization Challenges

• Session state management – High read/write throughput– State recovery when node crash/fail

• Session Expiration– Full table scan is not acceptable

Page 18: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Sessionization Solution

• Long live state management (At least 30 minutes)– Local Off-Heap Cache

• Instantaneous Session Expiration (<= 1sec delay)– Double-Linked Off-Heap Map (Local Access)– Order by Expiration time (O(1))

• Pluggable Sessionization logic– SQL with customized annotation– Counter– State

Page 19: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Sessionizer Architecture

Collector

Sessionizer

Sessionizer

Local Off-Heap Session Cache

IMC

Timer

Remote Store Client

Remote Session Store

Recovery

OMC

BotDetection

Distributor

Persist

Sync

Page 20: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Bot Detection Overview

• Detect non-human activities in near realtime• May treat bot traffic differently during analysis• High level bot rules

– Self-declared bots by user agent– Behavior within a session or time window

• Tag events with bot flag

Page 21: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Bot Detection

SessionizerCollector

BotDetection

Distributor

Behavior EventsAnd Metrics

BotSignature

Bot Detection Service

BotSignature

Bot taggedStream

Page 22: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

PULSAR INTEGRATION WITH OTHER SYSTEMSDivider sub-headline goes here

Page 23: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar Integration with Kafka

•Kafka– Persistent messaging queue– High availability, scalability and throughput

•Pulsar leveraging Kafka– Supports pull and hybrid messaging model– Loading of data from real-time pipeline into Hadoop and other metric stores– Use schema to validate event payload

23

Page 24: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Messaging Models

ProducerProducer

Queue

Kafka

Producer

Queue

Kafka

Replayer

Push Model

Pull Model

Pause/Resume

Hybrid Model

(At most once delivery semantics)

(At least once delivery semantics)

ConsumerConsumer

Consumer

Netty

Page 25: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar Integration with Kylin

• Apache Kylin – Distributed analytics engine– SQL interface and multi-dimensional analysis (OLAP) on Hadoop– Interactive Query on Billions of Rows

• Pulsar leveraging Kylin– Build multi-dimensional OLAP cube over long time period– Aggregate/drill-down on dimensions such as browser, OS, device, geo location– Capture metrics such as session length, page views, event counts

25

Page 26: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Pulsar Integration with Druid

• Druid– Real-time ROLAP engine for aggregation, drill-down and slice-n-dice

• Pulsar leveraging Druid– Real-time analytics dashboard– Near real-time metrics like number of visitors in the last 5 minutes, refreshing

every 10 seconds– Aggregate/drill-down on dimensions such as browser, OS, device, geo location

26

Page 27: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

BEHAVIORAL DATA DRIVEN APPLICATIONSDivider sub-headline goes here

Page 28: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

http://www.ebay.com/trending

Page 29: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

TRENDING: ALGORITHMS NARROW THE FOCUS

Algorithms and machine learning identify significant trends

Humans provide the context and and interesting story

Page 30: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

searchviewwatchbidpurchase

wtime

NearlineOffline(historical)

s vv…events… b

Online(in-session)

vpv s

Activity Timeline for Personalization

Customer Profile• Price• Category• Sale Type• Item Condition• Deals

Customer Intent• Price• Category• Sale Type• Item Condition• Deals

Page 31: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

EXAMPLE PERSONALIZED CONTENT

Personalized digest for a consumer interested in

jewelry and accessories

Personalized digest for a consumer interested in auto and electronics

Page 32: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

32

Behavior Data: A/B TestingBehavioral Data: A/B Testing

Page 33: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

More Information

•GitHub: http://github.com/pulsarIO–repos: pipeline, framework, docker files

•Website: http://gopulsar.io–Technical whitepaper–Getting started–Documentation

•Google group: http://groups.google.com/d/forum/pulsar

33

Page 34: Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

Recommended