Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | scylladb |
View: | 286 times |
Download: | 3 times |
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Arbitrary Stateful Aggregationsusing Structured Streaming
in Apache Spark™
Software Engineer, Databricks
Burak Yavuz
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Burak Yavuz
2
●Software Engineer – Databricks-‐ “We make your streams come true”●Apache Spark Committer as of Feb 2017●MS in Management Science & Engineering -‐Stanford University●BS in Mechanical Engineering -‐ Bogazici University, Istanbul
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
TEAM
About
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCTUnified Analytics Platform
MISSIONMaking Big Data Simple
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Outline
oStructured Streaming ConceptsoStateful Processing in Structured StreamingoUse Cases and How NoSQL Stores Fit InoDemos
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
The simplest way to perform streaming analyticsis not having to reason about streaming at all
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
New ModelInput: data from source as an append-‐only table
Trigger: how frequently to checkinput for new data
Query: operations on inputusual map/filter/reduce new window, session ops
Trigger: every 1 sec
1 2 3Time
data upto 1
Input data upto 2
data upto 3
Quer
y
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Trigger: every 1 sec
1 2 3
result for data up to 1
Result
Quer
y
Time
data upto 1
Input data upto 2
result for data up to 2
data upto 3
result for data up to 3
Output[complete mode]
output all the rows in the result table
New Model
Result: final operated table updated every trigger interval
Output: what part of result to write to data sink after every trigger
Complete output: Write full result table every time
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Trigger: every 1 sec
1 2 3
result for data up to 1
Result
Quer
y
Time
data upto 1
Input data upto 2
result for data up to 2
data upto 3
result for data up to 3
Output[append mode]
output only new rows since last trigger
Result: final operated table updated every trigger interval
Output: what part of result to write to data sink after every trigger
Complete output: Write full result table every time
Append output: Write only new rows that got added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Output Modes▪ Append mode (default) -‐ New rows added to the Result Table since the last trigger will be outputted to the sink. Rows will be output only once, and cannot be rescinded.
Example use cases: ETL
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Output Modes▪ Complete mode -‐ The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
Example use cases: Monitoring
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Output Modes▪ Update mode -‐ (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink.
Example use cases: Alerting, Sessionization
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Outline
oStructured Streaming ConceptsoStateful Processing in Structured StreamingoUse Cases and How NoSQL Stores Fit InoDemos
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Event time Aggregations
Many use cases require aggregate statistics by event timeE.g. what's the #errors in each system in 1 hour windows?
Many challengesExtracting event time from data, handling late, out-‐of-‐order data
DStream APIs were insufficient for event time operations
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hourparsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData.groupBy(
"device", window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each device every 10 mins
Use built-in functions to extract event-time No need for separate extractors
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Advanced Aggregations
Powerful built-‐in aggregations
Multiple simultaneous aggregations
Custom aggs using reduceGroups, UDAFs
parsedData.groupBy(window("timestamp","1 hour")).agg(avg("signal"), stddev("signal"), max("signal"))
variance, stddev, kurtosis, stddev_samp, collect_list, collect_set, corr, approx_count_distinct, ...
// Compute histogram of age by name.val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a/10)+=1 } (type, buckets)
}
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Stateful Processing for Aggregations
In-‐memory, streaming state maintained for aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to update counts of old windows
But size of the state increases indefinitely if old windows not dropped
red = state updated with late data
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Watermarking and Late Data
Watermark [Spark 2.1] -‐ a moving threshold that trails behind the max seen event time
Trailing gap defines how late data is expected to be
event time
max event time
watermark data older than
watermark not expected
12:30 PM
12:20 PM
trailing gapof 10 mins
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Watermarking and Late Data
Data newer than watermark may be late, but allowed to aggregate
Data older than watermark is "too late" and dropped
State older than watermark automatically deleted to limit the amount of intermediate state
max event time
event time
watermark
late dataallowed to aggregate
data too late,
dropped
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Watermarking and Late Data
Control the tradeoff between state size and lateness requirements
Handle more late à keep more stateReduce state à handle less lateness
max event time
event time
watermark
allowed latenessof 10 mins
parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()
late dataallowed to aggregate
data too late,
dropped
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Watermarking to Limit State [Spark 2.1]
data too late, ignored in counts, state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
Even
t Tim
e12:15
12:18
12:04
watermark updated to 12:14 - 10m = 12:04for next trigger, state < 12:04 deleted
data is late, but considered in counts
parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()
system tracks max observed event time
12:08
wm = 12:04
10 m
in
12:14
More details in blog post!
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Working With Time
df.withWatermark("timestampColumn", "5 hours").groupBy(window("timestampColumn", "1 minute")).count().writeStream.trigger("10 seconds")
Separate processing details (output rate, late data tolerance) from query semantics.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Working With Time
df.withWatermark("timestampColumn", "5 hours").groupBy(window("timestampColumn", "1 minute")).count().writeStream.trigger("10 seconds")
How to groupdata by time
Same in streaming & batch
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Working With Time
df.withWatermark("timestampColumn", "5 hours").groupBy(window("timestampColumn", "1 minute")).count().writeStream.trigger("10 seconds")
How latedata can be
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Working With Time
df.withWatermark("timestampColumn", "5 hours").groupBy(window("timestampColumn", "1 minute")).count().writeStream.trigger("10 seconds")
How oftento emit updates
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithStateallows any user-‐definedstateful ops to a user-‐defined state
Direct support for per-‐key timeouts in event-‐time or processing-‐time
supports Scala and Java
ds.groupByKey(groupingFunc).mapGroupsWithState
(timeoutConf)(mappingWithStateFunc)
def mappingWithStateFunc(key: K, values: Iterator[V], state: GroupState[S]): U = {
// update or remove state// set timeouts// return mapped value
}
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
flatMapGroupsWithState▪ Applies the given function to each group of data, while maintaining a user-‐defined per-‐group state▪ Invoked once per group in batch▪ Invoked each trigger (with the existence of data) per group in streaming▪ Requires user to provide an output mode for the function
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
flatMapGroupsWithState▪ mapGroupsWithState is a special case with
oOutput mode: UpdateoOutput size: 1 row per group
▪ Supports both Processing Time and Event Time timeouts
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Outline
oStructured Streaming ConceptsoStateful Processing in Structured StreamingoUse Cases and How NoSQL Stores Fit InoDemos
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Alerting
val monitoring = stream.as[Event].groupBy(_.id).flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>...
}.writeStream.queryName("alerts").foreach(new PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Alerting▪ Save your state to Scylla to power dashboards▪ Have the stream trigger alerts ASAP
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Sessionization
val monitoring = stream.as[Event].groupBy(_.session_id).mapGroupsWithState(GroupStateTimeout.EventTimeTimeout) {(id: Int, events: Iterator[Event], state: GroupState[…]) =>...
}.writeStream.scylla("trips")
Analyze sessions of user/system behavior
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Sessionization▪ Update sessions in your stream▪ Save it to a NoSQL store like Scylla!
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Demo
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
We are hiring!
https://databricks.com/company/careers
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
THANK YOU
“Does anyone have any questions for my answers?” - Henry Kissinger