Date post: | 16-Apr-2017 |
Category: |
Software |
Upload: | chang-ming-tsai |
View: | 322 times |
Download: | 1 times |
StatServer-Samza: Near Real-time Analytics
Ranking infrastructure, Tomy Tsai
Special thanks to: David Stein, Clement Fung
Why need Real-Time Counting
Click
Impression Discounting
Click
Other Use Cases• News feed– Show trending news• Impression boosting
• Most popular Ads• Real-time CTR = (# click / # views)
StatServer @ LinkedIn• Near real-time counting for in-session
relevance applications• Widely used in company products• Recently, we re-designed and implemented
StatServer in Samza– Why?
Outline• What does StatServer do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results
Query form• For each platform
how many times has the job titled “Engineer@LinkedIn” been viewed in last 5 minutes
• SELECT platform, count(*) FROM job_log WHERE title = “Engineer@LinkedIn” AND action = “view” AND time > now – 5 minutes GROUP BY platform
Pre-materialized query• Store counts in a Key-Value store– Key: job_title, action, time– Value: { platform: counts }
• Answer query by store lookup – Fast– High concurrency
Query form variation• Form 1:– Filter-by: job_title, action, time– Group-by: { platform: counts }
• Form 2– Filter-by: job_title, platform, time– Group-By: {action: counts}– Can reuse the extracted feature values from form 1
Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results
Infrastructure
StatServer
Kafka Tracking Data
Materialized Count Data
inVoldemort
Writer
Reader
Client Application Client Application
Writer Workflow
StatServer-Writer
Kafka DataMaterialized Count Data
inVoldemort
Job ClickFeature
Extractor
Batch queue
Job Click
Job ViewJob ViewFeature
Extractor
Batcher(Count
Aggregator)
Writerqueue
WriterWriterWriterWriter
Query form variationStatServer-Writer
Kafka DataMaterialized Count Data
inVoldemort
Job ClickFeature
Extractor
Batch queue
Job Click
Job View
Job ViewFeature
ExtractorBatcher(Count
Aggregator)
Writerqueue
WriterWriterWriterWriter
{job_title, platform, time} -> {action: counts}{job_title, action, time} -> {platform: counts}
Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results
StatServer issues• Multi-tenant problem– One application used too much system resource• all other applications were impacted
– Need a clean way to scale out applications
Partition vs Concurrency• All writers may write to the same key– Some writer may be bounced and need re-try– Cannot cache previous value locally
• What if we re-partition the data before writing?– We can do concurrent write – We can cache the value in local machine
Why Samza?• Resource isolation– Avoid multi-tenant problem
• Easy and flexible re-partition – Improve concurrency
• Other advantages– Local state support, scalability control
Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results
Original ArchitectureStatServer-Writer
Kafka DataMaterialized Count Data
inVoldemort
Job ClickFeature
Extractor
Batch queue
Job Click
Job View
Job ViewFeature
ExtractorBatcher(Count
Aggregator)
Writerqueue
WriterWriterWriterWriter
Use Kafka to Replace QueueStatServer-Samza-Writer
Kafka DataMaterialized Count Data
inVoldemort
Job ClickFeature
Extractor
Kafka (Extracted Features)
Job Click
Job View
Job ViewFeature
Extractor
Batcher(Count
Aggregator)
Kafka (Materialized Query)
WriterWriterWriterWriter
Batch queue
Writerqueue
Components as Samza JobsStatServer-Samza-Writer
Kafka DataMaterialized Count Data
inVoldemort
Job ClickFeature
Extractor
Kafka (Extracted Features)
Job Click
Job View
Job ViewFeature
Extractor
Batcher(Count
Aggregator)
Kafka (Materialized Query)
Writer
Another Query FormStatServer-Samza-Writer
Kafka Data
Materialized Count Data
inVoldemort
Job ClickFeature
ExtractorJob Click
Job ViewJob ViewFeature
Extractor
BatcherFor
Query Form 1
Writer forQuery Form 1
BatcherFor
Query Form 2
Writer forQuery Form 2
Why not combine batcher and writer?
Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results
Current Status• StatServer-Samza are running successfully– The computed results have been validated with
original StatServer• The writer is twice as efficient, because of– Improved concurrency– Effective local cache
Deployment Issues• System resource tuning is not straightforward– Restart causes a burst use of system resource
• Kafka topic partition number also needs fine-tune– Too many partition causes system resource overhead– Too few partition causes system lag behind
More Lessons Learned• Windowable task Manual commit• RocksDb (local store) doesn’t support “delete
all keys”– TTL is good for managing caches
Conclusion• Samza is a clean, powerful framework for
stream processing– StatServer-Samza is more modularized than before
• Samza/Kafka config can greatly impact performance
Thanks to
StatServer DevelopersDavid Stein
Lance Wall
Tomy Tsai
Clement Fung
Joel Young
Samza Team
Kafka Team
Voldemort Team