Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | spark-summit |
View: | 448 times |
Download: | 1 times |
SPARK SUMMIT EUROPE 2016
Distributed Time Series Analysis Framework For Spark
Larisa SawyerTwo Sigma
Larisa Sawyer
November 1, 2016 2
$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3/
1950
1/3/
1953
1/3/
1956
1/3/
1959
1/3/
1962
1/3/
1965
1/3/
1968
1/3/
1971
1/3/
1974
1/3/
1977
1/3/
1980
1/3/
1983
1/3/
1986
1/3/
1989
1/3/
1992
1/3/
1995
1/3/
1998
1/3/
2001
1/3/
2004
1/3/
2007
1/3/
2010
1/3/
2013
1/3/
2016
S&P 500
Time series examples
November 1, 2016
w Stock market prices
w Temperatures
w Height
w …
18°C
20°C
22°C
24°C
26°C
28°C
30°C
32°C
34°C
New York
Brussels
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
Larisa
3
What do we do with time series data?
November 1, 2016
w Forecast future values given past observations
$8.90 $8.95
$8.90
$9.06 $9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price??
?
4
November 1, 2016
Univariate time series
5
Multivariate time series
November 1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unaligned time series
w Temporal join is a fundamental operation for time series analysis
$8.90 $8.95
$8.90
$9.06 $9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
75°F72°F 71°F 72°F
68°F 67°F65°F
temperature
6
Multivariate time series
November 1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unaligned time series
w Temporal join is a fundamental operation for time series analysis
€7.94€7.98
€7.94
€8.08 €8.12
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
23°C22°C 21°C 22°C
20°C 19°C18°C
temperature
7
What is a left join?
November 1, 2016
time series 1 time series 2
8
What is temporal join?
November 1, 2016
w A particular join function defined by a matching criteria over time
w Examples of criteria
w look-backward
w look-forward
time series 1 time series 2
look-forward
time series 1 time series 2
look-backwardobservation
9
Temporal join with look-backward criteria
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
10
Important Legal Information
November 1, 2016 11
The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.
Temporal join with look-backward criteria
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
time tweets BRK.A
08:00 AM10:00 AM12:00 PM
12
Temporal join with look-backward criteria
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
time tweets BRK.A
08:00 AM10:00 AM12:00 PM
13
Temporal join with look-backward criteria
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
time tweets BRK.A
08:00 AM10:00 AM12:00 PM
14
Temporal join with look-backward criteria
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
time tweets BRK.A
08:00 AM10:00 AM12:00 PM
15
Temporal joins in practice
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AM
16
Time Series Scale
November 1, 2016
time tweets
08:00 AM10:00 AM12:00 PM
time BRK.A
08:00 AM
11:00 AMWe need fast and scalabledistributed temporal join
17
Existing solutions
November 1, 2016
w Existing packages don’t support temporal join or can’t handle large time series
w Pandas / R / Matlab
w Limited to single machine
w Spark
w Does scale, but all data is unordered
w spark-ts
w Expects univariate time series to fit on single machine
w Splits by col
w Supports only snapshot data
18
Flint: A new time series library for Spark
November 1, 2016
w Goal
w Provide a collection of functions to manipulate and analyze time series at scale
w Group, temporal join, summarize, aggregate …
w How
w Build a time series aware data structure
w TimeSeriesRDD extends RDD
w Optimize using temporal locality
w Reduce shuffling
w Reduce memory pressure by streaming
19
What is a TimeSeriesRDD?
November 1, 2016
w TimeSeriesRDD vs RDD
w Associate time range on each partition
w Track partition time-ranges
w Preserve temporal order
20
RDD
November 1, 2016
time temperature6:00 AM 60°F
6:01 AM 61°F
… …
7:00 AM 70°F
7:01 AM 71°F
… …
8:00 AM 80°F
8:01 AM 81°F
… …
RDD(6:00 AM, 60°F)(6:01 AM, 61°F)
…(7:00 AM, 70°F)(7:01 AM, 71°F)
…
(8:00 AM, 80°F)(8:01 AM, 81°F)
…
(6:58 AM, 64°F)(6:59 AM, 65°F)
…(7:34 AM, 74°F)(7:35 AM, 74°F)
…(7:58 AM, 76°F)(7:59 AM, 77°F)
…
Raw Data
21
TimeSeriesRDD
November 1, 2016
time temperature6:00 AM 60°F
6:01 AM 61°F
… …
7:00 AM 70°F
7:01 AM 71°F
… …
8:00 AM 80°F
8:01 AM 81°F
… …
RDD(6:00 AM, 60°F)(6:01 AM, 61°F)
…(7:00 AM, 70°F)(7:01 AM, 71°F)
…
(8:00 AM, 80°F)(8:01 AM, 81°F)
…
(6:58 AM, 64°F)(6:59 AM, 65°F)
…(7:34 AM, 74°F)(7:35 AM, 74°F)
…(7:58 AM, 76°F)(7:59 AM, 77°F)
…
(6:00 AM, 60°F)(6:01 AM, 61°F)
…
(8:00 AM, 80°F)(8:01 AM, 81°F)
…
(7:00 AM, 70°F)(7:01 AM, 71°F)
…
TSRDD[06:00 AM, 07:00 AM)
[07:00 AM, 8:00 AM)
[8:00 AM, ∞)
Raw Data
22
Group function
November 1, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
group 3
group 4
23
Group function
November 1, 2016
w A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
24
Group in Spark
November 1, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
25
w Data is shuffled and materialized on the workers
Group in Spark
November 1, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
sortBy
RDD
w Back to Temporal Orderw Temporal order is not preserved
26
Group in TimeSeriesRDD
November 1, 2016
w Data is grouped per partition locally as streams
TimeSeriesRDD
2:00PM
1:00PM
3:00PM
4:00PM
1:00PM
1:00PM
2:00PM
2:00PM
3:00PM
3:00PM
4:00PM
4:00PM
27
• Running time of count after group• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS
Benchmark for group + count
November 1, 2016
0s 20s 40s 60s 80s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD50 - 100X5 - 10X28
Temporal join
November 1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction – look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
29
Temporal join
November 1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction – look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
30
Temporal join
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series 1
31
time series 2
Temporal join
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
32
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
33
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Build dependency graph for the joined TimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
[1:00 AM, 4:00 AM)
[4:00 AM, 6:00 AM)
[1:00 AM, 4:00 AM)
[4:00 AM, 6:00 AM)
34
1:00AM1:00AM
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams per partition
1:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
35
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
36
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
5:00AM
1:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
4:00AM
3:00AM
37
Temporal join in TimeSeriesRDD
November 1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
38
Benchmark for temporal join + count
November 1, 2016
0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD20 - 50X5 - 10X39
• Running time of count after temporal join• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS
Functions over TimeSeriesRDD
November 1, 2016
w Grouping functions
w Temporal joins such as look-forward, look-backward etc.
w Summarizers such as average, variance, z-score etc. over grouping functions
40
Open Source
November 1, 2016
w True!
w https://github.com/twosigma/flint
41
What’s next?
November 1, 2016
w TimeSeriesDataframe / TimeSeriesDataset
w Speed upw Richer APIs
w Python bindings
w Additional summarizers
42
Key contributors
November 1, 2016
w Christopher Aycock
w Yuri Bogomolov
w Jonathan Coveney
w Li Jin
w David Medina
w Julia Meinwald
w David Palaitis
w Larisa Sawyer
w Leif Walsh
w Wenbo Zhao
43
Flint: Time Series For Spark
November 1, 2016
A library to solve for general time series analysis operations at massive scale
Anne Hathaway has nothing to do with Berkshire Hathaway
Check it out in open source, and contribute
https://github.com/twosigma/flint
44