Extending the Yahoo Streaming Benchmark

Post on 08-Jan-2017

1,348 views 0 download

transcript

Extending the Yahoo! Streaming Benchmark

Jamie Grier@jamiegrierjamie@data-artisans.com

Who am I?• Director of Applications Engineering at data

Artisans• Previously working on streaming

computation at Twitter, Gnip and Boulder Imaging

• Involved in various kinds of stream processing for about a decade

• High-speed video, social media streaming, general frameworks for stream processing

Overview• Yahoo! performed a benchmark comparing

Apache Flink, Storm and Spark• The benchmark never actually pushed Flink

to it’s throughput limits but stopped at Storms limits

• I knew Flink was capable of much more so I repeated the benchmarks myself

• I did a follow up blog post explaining my findings and will summarize them here

Yahoo! Benchmark• Count ad impressions grouped by

campaign• Compute aggregates over a 10 second

window• Emit current value of window aggregates

to Redis every second for query• Map ads to campaigns using Redis as well

Any questions so far?

Storm Code

Flink Code

Hardware Specs• 10 Kafka brokers with 2 partitions each• 10 compute nodes (Flink / Storm)• Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU

• 4 cores w/ hyperthreading• 32 GB RAM (only 8GB allocated to JVMs)

• 10 GigE Ethernet between compute nodes• 1 GigE Ethernet between Kafka cluster and compute

nodes

Logical Deployment

Data Generat

orKafka Source Filter Project Join

Redis

Window Sink Redis

Stream Processor

Redis

Apache StormDeployment

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

Apache Storm10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

Source / Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

Source / Filter / Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

Source / Filter / Project / Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

Window / Sink

FlinkData Generator

Redis

Shuffle

Source / Filter / Project / Join

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Apache FlinkDeployment

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Processing Guarantees

Apples and OrangesApache Storm Apache Flink

At least once semantics

Exactly once semantics

Double counting after failures No double counting

Lost state after failures No state loss

Benchmark

Storm

Flink

0 750,000 1,500,000 2,250,000 3,000,000 3,750,000

Baseline

Throughput: msgs/sec

Bottleneck AnalysisApache Storm

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Shuffle

Apache Storm10 Gige Link1 Gige Link

Redis

Redis

Bottleneck AnalysisApache Storm

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Shuffle

Apache Storm10 Gige Link1 Gige Link

Redis

Redis

CPU

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Network

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Eliminate theBottleneck

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Redis

FlinkData Generator

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Eliminate theBottleneck

Redis

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

DataGenerator

Eliminate theBottleneck

Redis

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

DataGenerator

Apache FlinkDeployment

Round 2

Benchmark

Storm

Flink

0 750,000 1,500,000 2,250,000 3,000,000 3,750,000

Baseline

Throughput: msgs/sec

BenchmarkRound 2

Storm

Flink

Flink (10 GigE)

0 4,000,000 8,000,000 12,000,000 16,000,000

10 GigE end-to-end

Throughput: msgs/sec

Results• Apache Flink achieved 15 million messages

/ sec on Yahoo! benchmark• Much stronger processing guarantees:

Exactly once• 80x higher than what was reported in the

original Yahoo! benchmark on similar hardware

Questions?

Storm Compatibility• Lot’s of companies already have applications

written using the Storm API• Flink provides a Storm compatibility layer• Run your Storm jobs on Flink with a one line

code change• Flink also allows you to reuse your existing

Storm spout and bolt code from a Flink job• Give it a try!

Thanks!