Date post: | 08-Jan-2017 |
Category: |
Data & Analytics |
Upload: | dataartisans |
View: | 51 times |
Download: | 2 times |
1
Kostas Tzoumas@kostas_tzoumas
Big Data LdnNovember 4, 2016
Stream Processing with Apache Flink®
2
Kostas Tzoumas@kostas_tzoumas
Big Data LdnNovember 4, 2016
Debunking Some Common Myths in Stream Processing
3
Original creators of Apache Flink®
Providers of the dA Platform, a supported
Flink distribution
Outline What is data streaming Myth 1: The throughput/latency tradeoff
Myth 2: Exactly once not possible
Myth 3: Streaming is for (near) real-time Myth 4: Streaming is hard
4
The streaming architecture
5
6
Reconsideration of data architecture
Better app isolation More real-time reaction to events Robust continuous applications Process both real-time and historical data
7
app state
app state
app state
event log
Queryservice
What is (distributed) streaming Computations on never-
ending “streams” of data records (“events”)
Stream processor distributes the computation in a cluster
8
Your code
Your code
Your code
Your code
What is stateful streaming Computation and state
• E.g., counters, windows of past events, state machines, trained ML models
Result depends on history of stream
Stateful stream processor gives the tools to manage state• Recover, roll back, version, upgrade,
etc
9
Your code
state
What is event-time streaming Data records associated with
timestamps (time series data)
Processing depends on timestamps
Event-time stream processor gives you the tools to reason about time• E.g., handle streams that are out of
order• Core feature is watermarks – a clock
to measure event time10
Your code
state
t3 t1 t2t4 t1-t2 t3-t4
What is streaming Continuous processing on data that
is continuously generated
I.e., pretty much all “big” data
It’s all about state and time11
Debunking some common stream processing myths
12
Myth 1: Throughput/latency tradeoff Myth 1: you need to choose between high
throughput or low latency
Physical limits• In reality, network determines both the
achievable throughput and latency• A well-engineered system achieves these limits
13
Flink performance 10s of millions events per seconds in 10s of
nodes scaled to 1000s of nodes with latency in single-digit milliseconds
14
15
Myth 2: Exactly once not possible Exactly once: under failures, system computes result as if
there was no failure
In contrast to:• At most once: no guarantees• At least once: duplicates possible
Exactly once state versus exactly once delivery
Myth 2: Exactly once state not possible/too costly
Transactions “Exactly once” is transactions: either all actions
succeed or none succeed
Transactions are possible
Transactions are useful
Let’s not start eventual consistency all over again…
16
Flink checkpoints Periodic asynchronous consistent snapshots of
application state
Provide exactly-once state guarantees under failures
17
End-to-end exactly once Checkpoints double as transaction coordination mechanism
Source and sink operators can take part in checkpoints
Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates
18
transactional sinks
State management Checkpoints triple as state
versioning mechanism (savepoints)
Go back and forth in time while maintaining state consistency
Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests
19
Myth 3: Streaming and real time Myth 3: streaming and real-time are
synonymous
Streaming is a new model• Essentially, state and time• Low latency/real time is the icing on the
cake20
Low latency and high latency streams
21
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Stream (low latency)
Batch(bounded stream)Stream (high latency)
Robust continuous applications
22
Accurate computation Batch processing is not an accurate
computation model for continuous data• Misses the right concepts and primitives• Time handling, state across batch boundaries
Stateful stream processing a better model• Real-time/low-latency is the icing on the cake
23
Myth 4: How hard is streaming? Myth 4: streaming is too hard to learn
You are already doing streaming, just in an ad hoc way
Most data is unbounded and the code changes slower than the data• This is a streaming problem
24
It's about your data and code What's the form of your data?• Unbounded (e.g., clicks, sensors, logs), or• Bounded (e.g., ???*)
What changes more often?• My code changes faster than my data• My data changes faster than my code
25
* Please help me find a great example of naturally bounded data
It's about your data and code If your data changes faster than your
code you have a streaming problem• You may be solving it with hourly batch
jobs depending on someone else to create the hourly batches
• You are probably living with inaccurate results without knowing it
26
It's about your data and code If your code changes faster than your
data you have an exploration problem• Using notebooks or other tools for quick
data exploration is a good idea• Once your code stabilizes you will have
a streaming problem, so you might as well think of it as such from the beginning 27
Flink in the real world
28
29
Flink community > 240 contributors, 95 contributors in Flink 1.1
42 meetups around the world with > 15,000 members
2x-3x growth in 2015, similar in 2016
Powered by Flink
30
Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business
process monitoring.
King, the creators of Candy Crush Saga, uses Flink to provide data
science teams with real-time analytics.
Bouygues Telecom uses Flink for real-time event processing over billions of
Kafka messages per day.
Alibaba, the world's largest retailer, built a Flink-based system (Blink) to
optimize search rankings in real time.
See more at flink.apache.org/poweredby.html
30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
31
32
Flink Forward 2016
Current work in Flink
34
Ongoing Flink development
35
ConnectorsSession
Windows(Stream) SQL
Libraryenhancements
MetricSystem
Operations
Ecosystem ApplicationFeatures
Metrics &Visualization
Dynamic Scaling
Savepointcompatibility Checkpoints
to savepoints
More connectors Stream SQLWindows
Large stateMaintenance
Fine grainedrecovery
Side in-/outputsWindow DSL
BroaderAudience
Security
Mesos &others
Dynamic ResourceManagement
Authentication
Queryable State
A longer-term vision for Flink
36
37
Streaming use casesApplication
(Near) real-time apps
Continuous apps
Analytics on historical data
Request/response apps
TechnologyLow-latency streaming
High-latency streaming
Batch as special case of streaming
Large queryable state
Request/response applications Queryable state: query Flink state directly instead
of pushing results in a database
Large state support and query API coming in Flink
38
queries
In summary The need for streaming comes from a rethinking
of data infra architecture• Stream processing then just becomes natural
Debunking 4 common myths• Myth 1: The throughput/latency tradeoff• Myth 2: Exactly once not possible• Myth 3: Streaming is for (near) real-time• Myth 4: Streaming is hard
39
40
Thank you!@kostas_tzoumas @ApacheFlink @dataArtisans
41
We are hiring!
data-artisans.com/careers