+ All Categories
Home > Internet > Streaming Processing in Uber Marketplace for Kafka Summit 2016

Streaming Processing in Uber Marketplace for Kafka Summit 2016

Date post: 05-Apr-2017
Category:
Upload: danny-yuan
View: 26 times
Download: 0 times
Share this document with a friend
85
STREAM PROCESSING IN UBER MARKETPLACE
Transcript

STREAM PROCESSING IN UBER MARKETPLACE

~ 68 countries / 350+ cities Transportation as reliable as running water, everywhere, for everyone

2

AgendaWhat’s on the menu?

•Use Cases •Problem Space •Overall Architecture •Choices & Tradeoffs •Q & A

Use Case: Realtime OLAP

There is always need for quick exploration

How many open cars in the world, NOW?

How many UberXs were driving clients in SF in the past 10 minutes by hexagons?

How many UberXs were driving clients in SF in the past 10 minutes by hexagons?

Driving time and other metrics over time by hexagonal area

Use Case: Complex Event Processing

There are patterns in event streams

How many drivers cancel requests more than 3 times in a row within a 10-

minute window?

Report riders requesting a pickup 100 miles apart within a half hour window?

IF

This —>

Then that —>

● Sigma is similar - but for offline/batch applications

Complex Event Processing

Use Case: Supply Positioning

Clusters Of Supply & Demand

Predicted Health Metrics

Actual Health Metrics

Monitor Marketplace Health

Challenges

OLAP of Geo-spatial Temporal Data

Reasonably Large Scale

Near Real Time

• Indexing, Lookup, Rendering

• Symmetric Neighbors

• Convex & Compact Regions

• Equal Areas

• Equal Shape

Hexagons

Scale

Geo Space Vehicle Types Time Status

X X X

Granular Geo Areas

Granular Geo Areas

Over 10,000 hexagons in a city

Multiple Vehicle Types

7 vehicle types

Minute-level Time Buckets

1440 minutes in a day

Many Driver States

13 driver states

Many Cities

300 cities

Granular Data

1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations

Unknown Query Patterns

Any combination of dimensions

Variety of Aggregations - Heatmap

- Top N

- Histogram

- count(), avg(), sum(), percent(), geo

Large Data Volume

• Hundreds of thousands of events per second

• At least dozens of fields in each event

Multiple TopicsRider States Driver States

Let’s build a stream processing pipeline

Accurate Statistics

• E.g., can’t over count

Pipeline Template

Event Collection

Multiple Event Types with Different Volume

Hundreds of Thousands of Events Per Second

Events Should Be Available Under a Second

Events Should Rarely Get Lost

Multiple Consumers

Natural Choice: Apache Kafka

- Low latency and high throughput

- Persistent events

- Distributes a topic by partitions

- Groups consumers by consumer groups

Event Processing

Transformation

Event Transformation Example

(Lat, Long) -> (zipcode, hexagon, S2)

Pre-aggregation

Joining Multiple Streams

Sessionization

Multi-Staged Processing

Minimum Requirements

- Statement Management

- Checkpointing

- Automatic Resource Management

- Multi-staged processing

Apache Samza

Why Apache Samza? - DAG on Kafka

- Excellent integration with Kafka

- Built-in checkpointing

- Built-in state management

- Excellent support from our data team

Samza Is Conceptually Simple

IF

This —>

Then that —>

● Sigma is similar - but for offline/batch applications

Complex Event Processing

● Sigma is similar - but for offline/batch applications

Complex Event Processing

● Sigma is similar - but for offline/batch applications

Complex Event Processing

● Sigma is similar - but for offline/batch applications

Complex Event Processing

● Sigma is similar - but for offline/batch applications

Complex Event Processing

● Sigma is similar - but for offline/batch applications

Slightly Expanded Version

● Sigma is similar - but for offline/batch applications

Slightly Expanded Version

● Sigma is similar - but for offline/batch applications

Slightly Expanded Version

● Sigma is similar - but for offline/batch applications

Slightly Expanded Version

Applications

Dashboard of Realtime Business Metrics

Ad-Hoc Queries

Visualization with Streaming

Visualization with Streaming

LocationUpdatewherecity=X

LocationUpdatewherecity=Yandvehicle=‘UberX’

100%

100%

100%

10%

5%

Visualization with Streaming

LocationUpdatewherecity=X

LocationUpdatewherecity=Yandvehicle=‘UberX’

100%

100%

100%

10%

5%

Visualization with Streaming

LocationUpdatewherecity=X

LocationUpdatewherecity=Yandvehicle=‘UberX’

100%

100%

100%

10%

5%

Visualization with Streaming

LocationUpdatewherecity=X

LocationUpdatewherecity=Yandvehicle=‘UberX’

100%

100%

100%

10%

5%

Visualization with Streaming

LocationUpdatewherecity=X

LocationUpdatewherecity=Yandvehicle=‘UberX’

100%

100%

100%

10%

5%

Visualization with Streaming

LocationUpdatewherecity=‘SF’

LocationUpdatewherecity=‘LA’andvehicle

10%

5%

100% 100%

Ad-hoc Exploration

A Few Trade-Offs

Lambda vs Kappa

We Use Lambda - Spark + HDFS/S3 for batch processing - Yes, it is painful, but

- We may need to go way back due to change of business requirements

- Batch process can run faster — they scale differently - It was not easy to start a new stream processing instance

Processing by Event Time Is Not Always Easy

Leverage The Storage Layer

Dealing with Limitation of Samza -No broadcasting. We have to override SystemStreamPartitionGrouper

-No dynamic topology. Can’t have arbitrary number of

nested CEP queries

-Tedious configuration and deployment of jobs. In house

code-gem and deployment solution

Thank You


Recommended