Date post: | 08-Jan-2017 |
Category: |
Technology |
Upload: | tony-ng |
View: | 555 times |
Download: | 3 times |
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
September 2016
Tony NgDirector of Engineering
EBAYBUSINESS
*Q2 2016 data
New Items
80%Fixed Price
Items
86%
Evolving from our Auction Roots..
Ships for Free
65%
EBAY AT A GLANCE
$8.6B Revenue in 2015
$82BGMV in 2015
1BLive Listings *
164M Global Active Buyers *
190Countries eBay apps
are available in #
*Q2 2016 data#Q4 2015 data
326MApp downloads*
CHECKOUT
WATCH LIST
SHARE
LOCAL
PREFERENCES
BID
10’s TB USER BEHAVIORAL
DATA / DAY
10’s B DATABASE
QUERIES / DAY
100’s PBDATA
CLICK
SEARCH
DAT
A
ENTERPRISE DATA PLATFORM
Data StoresData Streams &
Processing
Machine Learning
Enterprise Data Ecosystem
Data Ingestion
Personalization / Optimization
Insights / Reporting
Kylin
Open-source real-time analytics and stream processing framework
Pulsar Stream
• Focus on user behavioral data processing• Complex event processing
– Streaming SQL with extensible annotations– Java
• SQL for common stream operations (Filtering, mutation, aggregation) with time windows
• Declarative topology construction• Each stage can adopt its own release and deployment cycles• Dynamic partitioning and flow control
Multi Stage Distributed Pipeline
Event Filtering and Routing Example
// create filtered streaminsert into FilteredStream select guid, evt_type, C1, C2, C3 from RawStream where evt_type = ‘bid’;
// publish and route filtered stream@PublishOn(topics=“Topic1”)@Output(“OutboundChannel”)@ClusterAffinityTag(column = guid)select * from SubStream;
Aggregate Computation Example
// create 10-second time window contextcreate context MCContext start @now and pattern (timer:interval(10)];
// create aggreated stream within specified time windowcontext MCContext insert into AggStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type output snapshot when terminated;
// publish aggregated streamselect * from AggStream;
Stream Aggregation Time Window
Sliding Window
TumblingWindow
TopN Computation Example
// create 60-second time window contextcreate context MCContext start @now and pattern (timer:interval(60)];
// create topN stream via sortingcontext MCContext insert into TopNStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type order by M1 limit 10;
// publish topN streamselect * from TopNStream;
PULSAR BEHAVIORAL DATA PIPELINE
Pulsar Behavioral Data Pipeline
Sessionizer MetricsCalculator
EventDistributor
Real TimeConsumers
Metrics Store
Collector
Real-time Pipeline
BOTDetection
EnrichedSessionized Events
ProducingApplications
Real TimeDashboard and Services
Kafka DruidHaddop / Kylin
Batch Loader
Sessionization: Group together events of a single user visit
e1 e2 e3 e4 e5 e7 e8
User 1: >30 min of inactivity
Session A (User 1): e1, e2, e4
Session B (User 2): e3, e5, e6
Session C (User 1): e7, e8
e6
. . .
Sessionization Challenges
• Session state management – High read/write throughput– State recovery when node crash/fail
• Session Expiration– Full table scan is not acceptable
Sessionization Solution
• Long live state management (At least 30 minutes)– Local Off-Heap Cache
• Instantaneous Session Expiration (<= 1sec delay)– Double-Linked Off-Heap Map (Local Access)– Order by Expiration time (O(1))
• Pluggable Sessionization logic– SQL with customized annotation– Counter– State
Sessionizer Architecture
Collector
Sessionizer
Sessionizer
Local Off-Heap Session Cache
IMC
Timer
Remote Store Client
Remote Session Store
Recovery
OMC
BotDetection
Distributor
Persist
Sync
Bot Detection Overview
• Detect non-human activities in near realtime• May treat bot traffic differently during analysis• High level bot rules
– Self-declared bots by user agent– Behavior within a session or time window
• Tag events with bot flag
Bot Detection
SessionizerCollector
BotDetection
Distributor
Behavior EventsAnd Metrics
BotSignature
Bot Detection Service
BotSignature
Bot taggedStream
PULSAR INTEGRATION WITH OTHER SYSTEMSDivider sub-headline goes here
Pulsar Integration with Kafka
•Kafka– Persistent messaging queue– High availability, scalability and throughput
•Pulsar leveraging Kafka– Supports pull and hybrid messaging model– Loading of data from real-time pipeline into Hadoop and other metric stores– Use schema to validate event payload
23
Messaging Models
ProducerProducer
Queue
Kafka
Producer
Queue
Kafka
Replayer
Push Model
Pull Model
Pause/Resume
Hybrid Model
(At most once delivery semantics)
(At least once delivery semantics)
ConsumerConsumer
Consumer
Netty
Pulsar Integration with Kylin
• Apache Kylin – Distributed analytics engine– SQL interface and multi-dimensional analysis (OLAP) on Hadoop– Interactive Query on Billions of Rows
• Pulsar leveraging Kylin– Build multi-dimensional OLAP cube over long time period– Aggregate/drill-down on dimensions such as browser, OS, device, geo location– Capture metrics such as session length, page views, event counts
25
Pulsar Integration with Druid
• Druid– Real-time ROLAP engine for aggregation, drill-down and slice-n-dice
• Pulsar leveraging Druid– Real-time analytics dashboard– Near real-time metrics like number of visitors in the last 5 minutes, refreshing
every 10 seconds– Aggregate/drill-down on dimensions such as browser, OS, device, geo location
26
BEHAVIORAL DATA DRIVEN APPLICATIONSDivider sub-headline goes here
http://www.ebay.com/trending
TRENDING: ALGORITHMS NARROW THE FOCUS
Algorithms and machine learning identify significant trends
Humans provide the context and and interesting story
searchviewwatchbidpurchase
wtime
NearlineOffline(historical)
s vv…events… b
Online(in-session)
vpv s
Activity Timeline for Personalization
Customer Profile• Price• Category• Sale Type• Item Condition• Deals
Customer Intent• Price• Category• Sale Type• Item Condition• Deals
EXAMPLE PERSONALIZED CONTENT
Personalized digest for a consumer interested in
jewelry and accessories
Personalized digest for a consumer interested in auto and electronics
32
Behavior Data: A/B TestingBehavioral Data: A/B Testing
More Information
•GitHub: http://github.com/pulsarIO–repos: pipeline, framework, docker files
•Website: http://gopulsar.io–Technical whitepaper–Getting started–Documentation
•Google group: http://groups.google.com/d/forum/pulsar
33