Date post: | 13-Jan-2017 |
Category: |
Technology |
Upload: | ed-yakabosky |
View: | 163 times |
Download: | 0 times |
Apache SamzaPast, Present and Future
Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn
Agenda
1. Stream Processing 2. State of the Union3. Apache Samza : Key Differentiators 4. Apache Samza Futures
Stream Processing: Processing events as soon as they happen..
● Stateless Processing
■ Transformation etc.
■ Lookup adjunct data (lookup databases/call services )
■ Producing results for every event
● Stateful Processing
■ Triggering/Producing results periodically (time-windows)
● Maintain intermediate state
■ E.g. Joining across multiple streams of events.
● Common Issues
■ Scale !! Scale !! Scale !!
■ Reliability !!
■ Everything else (upgrades, debugging, diagnostics, security, ……)
Stream Processing: State of the Union
MillwheelStorm
Heron
Spark Streaming
S4
Dempsey
Samza
Flink
Beam
DataflowAzure Stream Analytics
AWS Kinesis AnalyticsGearPumpKafka Streams
Orleans
Not meant to be an accurate timeline..
Yes It is CROWDED !!
Apache Samza
● Top level Apache project since Dec 2014● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11)● 62 Contributors● 14 Committers● Companies using : LinkedIn, Uber, MetaMarkets, Netflix,
Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By
● Applications at LinkedIn : from ~20 to ~200 in 2 years.
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Performance : Accessing Adjunct Data
Samza ProcessorDatabase
Remote-read
Samza ProcessorCapture changes
Databus, Brooklin
Rocks-DB
Local readDatabase
Local data access
Remote data access
Input stream
Input stream
Performance : Maintaining Temporary State
Samza Processor RemoteDatabase
Read-Write
Samza ProcessorBackup changes
Kafka Change Log(Log compacted)Rocks-
DB
Local read/write
Local data access
Remote data access
Input stream
Input stream
In Memory Store
Performance : Let us talk numbers !
● 100x Difference between using Local State vs Remote No-Sql store
● Local State details:
○ 1.1 Million TPS on a single processing machine (SSD)
○ Used a 3 node Kafka cluster for storing the durable changelog
● Remote State details:
○ 8500 TPS when the Samza job was changed to accessing a remote No-Sql store
○ No-Sql Store was also on a 3 node (ssd) cluster
Remote State : Asynchronous Event Processing
Event Loop(Single thread)
ProcessAsync
Remote DB /Services
Asynchronous I/O calls, using Java Nio, Netty...
Responses sent to main thread via callback
Event loop is woken up to process next message
Task.max.concurrency >1 to enable pipelining
Available with Samza 0.11
Remote State: Synchronous Processing on Multiple Threads
Event Loop(Single thread)
Schedule Process()
Remote DB/ Services Built-In
Thread pool
Blocking I/O calls
Event loop is woken up by the worker thread
job.container.thread.pool.size = N
Available with Samza 0.11
Incremental Checkpointing : MVP for stateful apps
Some applications have ~ 2 TB state in production
Stateful apps don’t really work without incremental checkpointing
Samza Task
State
changelog chec
kpoi
nt
Host 1
Input stream(e.g. Kafka)
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Speed Thrills .. but can kill
● Local State Considerations: ○ State should NOT be reseeded under normal operations (e.g.
Upgrades, Application restarts)
○ Minimal State should be reseeded
- If a container dies/removed
- If a container is added
How Samza keeps Local state ‘stable’ ?
P0
P1 P2 P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Coordinator Stream: Task-Container-Host Mapping
Container-0 -> Host-E
Container-1 -> Host-BContainer-2 -> Host-C
AM JC
Yarn-RM
Ask: Host-E Allocate: Host-E
Samza Job
Input Stream
Change-log
Enable Continuous Scheduling
Stream A
Stream B
Stream C
Job 1
Job 2
● Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline.
● Allows each stage to be independent of the next stage
Backpressure in a Pipeline
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Pluggable system consumers
Samza Processor
Mongo DB
DynamoDB Streams
Kafka
Databus/Brooklin
Kinesis
ZeroMQ
… Azure EventHub, Azure Document DB, Google Pub-Sub etc.
Oracle, Espresso
Dynamo-DB
Batch processing in Samza!! (NEW)
● HDFS system consumer for Samza
● Same Samza processor can be used for processing events from Kafka and HDFS with no code changes
● Scenarios :
○ Experimentation and Testing
○ Re-processing of large datasets
○ Some datasets are readily available on HDFS (company specific)
Samza - HDFS support
HDFS input
(Samza)Processor HDFS
output
(Samza) Processor HDFS
output
HDFS input(Samza)
Processor Kafka
Kafka
New
Available since Samza 0.10
The batch job auto-terminates when the input is fully processed.
Reprocessing entire Dataset onlineUpdates
(Samza) Processor
(Samza) Processor
Bootstrap
output
Kafka
Brooklin
Brooklin
Database(espresso)
set offset=0
Nearline Processing
Reprocessing ‘large’ Datasets “offline”Updates
(Samza) Processor
(Samza) Processor
Backup
output
Kafka
DatabusDatabase(espresso)
Database Backup (HDFS)
Nearline Processing
Offline Processing
Samza batch pipelines
(Samza) Processor
HDFS output
HDFS input
(Samza)Processor Kafka
(Samza) Processor
HDFS output
HDFS input
(Samza)Processor HDFS
Samza- HDFS Early Performance Results !!
Benchmark : Count number of records grouped by <Field>
DataSize (bytes): 250 GBNumber of files : 487
Samza
Map/Reduce
SparkNumber of Containers
Time-seconds
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources (batch and streaming)
● Stream processing as a service AND (coming soon) as an embedded library
Stream Processing as a Service
● Based on YARN○ Yarn-RM high availability ○ Work preserving RM ○ Support for Heterogenous hardware with Node Labels (NEW)
● Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application.
● Disk Quotas for local state (e.g. rocksDB state)● Samza Management Service(SAMZA-REST)-> Next Slide
YARN Resource Managers
Nodes in the YARN cluster
RM SRR RM SRR RM SRR
NM SRN
Samza Management Service (Samza REST) (NEW)
NM SRN NM SRN NM SRN
NM SRN NM SRN NM SRN NM SRN
/v1/jobs /v1/jobs /v1/jobs
Samza Containers
1. Exposes /jobs resource to start, stop, get status of jobs etc.
2. Cleans up stores from dead jobs
SamzaREST
YARN processes(RM/NM)
Agenda
1. Stream processing2. State of the union3. Apache Samza : Key differentiators
4. Apache Samza Futures
Coming Soon : Samza as a Library
Stream Processor
Code Job Coordinator
Stream Processor
Code Job Coordinator
Stream Processor
Code Job Coordinator
...
Leader
● No YARN dependency● Will use ZK for leader
election
● Embed stream processing into your bigger application
StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”);processor.start();processor.awaitStart();…processor.stop();
Coming Soon: High Level API and Event Time (SAMZA-914/915)
Count the number of PageViews by Region, every 30 minutes.@Override public void init(Collection<SystemMessageStream> sources) {
sources.forEach(source -> {
Function<PageView, String> keyExtractor = view -> view.getRegion();
source.map(msg -> new PageViewMessage(msg))
.window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor,
WindowType.Tumbling, 30*60 ))
});
}
Coming Soon: First class support for Pipelines (Samza- 1041)
public class MyPipeline implements PipelineFactory {
public Pipeline create(Config config) {
Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config);
Stream inStream = getStream(config, “inStream1”); // … omitted for brevity
PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams(myShuffler, inStream1) .addOutputStreams(myShuffler, intermediateOutStream) .addInputStreams(myJoiner, intermediateOutStream, inStream2) .addOutputStreams(myJoiner, finalOutStream) .build(); }}
Shuffle
Join
input output
Future: Miscellaneous
● Exactly once processing● Making it easier to auto-scale even with Local State (on-
demand Standby containers)● Turnkey Disaster Recovery for stateful applications
○ Easy Restore of changelog and checkpoints from some other datacenter.
● Improved support for Batch jobs● SQL over Streams● A default Dashboard :)
Questions ?