StatSever-Samza: Near Real-Time Analytics

StatServer-Samza: Near Real-time Analytics

Ranking infrastructure, Tomy Tsai

Special thanks to: David Stein, Clement Fung

Why need Real-Time Counting

Click

Impression Discounting

Click

Other Use Cases• News feed– Show trending news• Impression boosting

• Most popular Ads• Real-time CTR = (# click / # views)

StatServer @ LinkedIn• Near real-time counting for in-session

relevance applications• Widely used in company products• Recently, we re-designed and implemented

StatServer in Samza– Why?

Outline• What does StatServer do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Query form• For each platform

how many times has the job titled “Engineer@LinkedIn” been viewed in last 5 minutes

• SELECT platform, count(*) FROM job_log WHERE title = “Engineer@LinkedIn” AND action = “view” AND time > now – 5 minutes GROUP BY platform

Pre-materialized query• Store counts in a Key-Value store– Key: job_title, action, time– Value: { platform: counts }

• Answer query by store lookup – Fast– High concurrency

Query form variation• Form 1:– Filter-by: job_title, action, time– Group-by: { platform: counts }

• Form 2– Filter-by: job_title, platform, time– Group-By: {action: counts}– Can reuse the extracted feature values from form 1

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Infrastructure

StatServer

Kafka Tracking Data

Materialized Count Data

inVoldemort

Writer

Reader

Client Application Client Application

Writer Workflow

StatServer-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job ViewJob ViewFeature

Extractor

Batcher(Count

Aggregator)

Writerqueue

WriterWriterWriterWriter

Query form variationStatServer-Writer


inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue


{job_title, platform, time} -> {action: counts}{job_title, action, time} -> {platform: counts}


StatServer issues• Multi-tenant problem– One application used too much system resource• all other applications were impacted

– Need a clean way to scale out applications

Partition vs Concurrency• All writers may write to the same key– Some writer may be bounced and need re-try– Cannot cache previous value locally

• What if we re-partition the data before writing?– We can do concurrent write – We can cache the value in local machine

Why Samza?• Resource isolation– Avoid multi-tenant problem

• Easy and flexible re-partition – Improve concurrency

• Other advantages– Local state support, scalability control


Original ArchitectureStatServer-Writer


inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue


Use Kafka to Replace QueueStatServer-Samza-Writer


inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)


Batch queue

Writerqueue

Components as Samza JobsStatServer-Samza-Writer


inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)

Writer

Another Query FormStatServer-Samza-Writer

Kafka Data

Materialized Count Data

inVoldemort

Job ClickFeature

ExtractorJob Click

Job ViewJob ViewFeature

Extractor

BatcherFor

Query Form 1

Writer forQuery Form 1

BatcherFor

Query Form 2

Writer forQuery Form 2

Why not combine batcher and writer?


Current Status• StatServer-Samza are running successfully– The computed results have been validated with

original StatServer• The writer is twice as efficient, because of– Improved concurrency– Effective local cache

Deployment Issues• System resource tuning is not straightforward– Restart causes a burst use of system resource

• Kafka topic partition number also needs fine-tune– Too many partition causes system resource overhead– Too few partition causes system lag behind

More Lessons Learned• Windowable task Manual commit• RocksDb (local store) doesn’t support “delete

all keys”– TTL is good for managing caches

Conclusion• Samza is a clean, powerful framework for

stream processing– StatServer-Samza is more modularized than before

• Samza/Kafka config can greatly impact performance

Thanks to

StatServer DevelopersDavid Stein

Lance Wall

Tomy Tsai

Clement Fung

Joel Young

Samza Team

Kafka Team

Voldemort Team

Date post:	16-Apr-2017
Category:	Software
Upload:	chang-ming-tsai
View:	322 times
Download:	1 times

StatSever-Samza: Near Real-Time Analytics

Software