Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | apache-apex |
View: | 923 times |
Download: | 0 times |
Extending the Yahoo
Streaming Benchmark for Apache Apex
San Jose Apache Apex MeetupMay 4th 2016
Sandesh [email protected]
Background
• Yahoo created a benchmark to compare Stream processing systems and
compared Storm, Flink and Spark Streaming [1]
• dataArtisans extended the benchmark by comparing Flink and Storm with different scenarios [2]
• No benchmark comparison about Stream processing is complete without including Apache Apex.
2
Yahoo Streaming Benchmark
Simple Advertisement Application : To see how many times an ad campaign has been seen in an window.
• Read ads from Kafka• Deserialize JSON string
• Filter unnecessary ads
• Projection of Fields ( remove non-essential fields )
• Join ad id with campaign id from Redis
• Windowed count per campaign and output to Redis
3
Setup• Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz• 10GigE Between compute nodes• 4 Kafka Brokers ( 2 Partitions each & 1 Replica )• Kafka Version : 0.8.2• Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 )• Yarn-Containers size: 16GB• 1 ZooKeeper• Message Size: 218 Bytes• Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c","
page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":"600589859","ad_type":"banner78","event_type":"purchase","event_time":"1462374087774","ip_address":"1.2.3.4"}
5
Quick Primer on Locality
8
•
• CONTAINER_LOCAL■ Deployed in the same process, different threads■ No serialization■ Queue between the operators
• THREAD_LOCAL■ Same thread■ No serialization■ Use it only when operators do light work
Note: [New feature] Anti Affinity is not covered here.
Benchmarking Against Previous Releases
9https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
Part of Release Certification
Application : with Kafka
10
https://github.com/sandeshh/streaming-benchmarks
Application - With Generator
11
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Generator
Application - With Generator
12
https://github.com/sandeshh/streaming-benchmarks Setup: Single Partition
State of the Art & Streaming
13
Generator Filter Redis OutputRedis JoinFilter Fields
What’s our recommendation to query the State?In memory Key-Value store in the operators?
Application - State Store & Query
14
Generator FilterDimensional Computation
Redis JoinFilter Fields Store (HDHT) QueryResult
1. Durable state ( HDHT is a key value store native to Hadoop ) [4]2. Single System, scales with your application3. Easy integration with external Consoles [7]4. Low operability cost
5. Complex Dimensional Computation [5][6]
References
17
1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/
5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/
6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2-implementation/
7. http://docs.datatorrent.com/app_data_framework/
© 2016 DataTorrent
Resources
18
• Apache Apex website - http://apex.apache.org/
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/
© 2016 DataTorrent
We Are Hiring
19
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders