+ All Categories
Home > Documents > WANalytics: Analytics for a geo- distributed data...

WANalytics: Analytics for a geo- distributed data...

Date post: 31-Dec-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
51
WANalytics: Analytics for a geo- distributed data-intensive world Ashish Vulimiri * , Carlo Curino + , Brighten Godfrey * , Konstantinos Karanasos + , George Varghese + * UIUC + Microsoft
Transcript
Page 1: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

WANalytics: Analytics for a geo-distributed data-intensive world

Ashish Vulimiri*, Carlo Curino+,Brighten Godfrey*, Konstantinos Karanasos+,

George Varghese+

* UIUC + Microsoft

Page 2: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Large organizations today:Massive data volumes

•  Data collected acrossseveral data centers forlow end-user latency

•  Use cases:–  User activity logs–  Telemetry–  …

DC1  DC2  

DC3  

Page 3: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Current scales: 10s-100s TB/day

Microsoft n * 10s TB/dayTwitter 100 TB/dayFacebook 15 TB/dayYahoo 10 TB/dayLinkedIn 10 TB/day

across up to 10s of data centers

Page 4: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data must be analyzed as a whole

•  Need to analyze all this data to extract insight

•  Production workloads today:– Mix of SQL, MapReduce,

machine learning, …

Analy&cs  

SQL  

MR  

ML  

MR  

MR  k-­‐means  

Page 5: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Analytics on geo-distributed data:Centralized approach inadequate

Current solution: copy all data to central DC, run analytics there

1.  Consumes a lot of bandwidth–  Cross-DC bandwidth is expensive, very scarce–  “Total Internet capacity” only ≈ 100 Tbps

2.  Incompatible with sovereignty–  Many countries considering making copying

citizens’ data outside illegal–  Speculation: derived info will still be OK

Page 6: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Alternative: Geo-distributed analytics

we build system supporting geo-distributedanalytics execution-  Leave data partitioned across DCs-  Push compute down (distribute workflow

execution)

Page 7: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Geo-distributed analyticspreprocess  adserve_log  

⋈MapReduce  click_log  

DC1   adserve_log  

SQL  

k-­‐means  clustering  

Mahout  preprocess  click_log  

MapReduce  

adserve_log  

click_log  

Distributed  execu&on:      0.03  TB/day  Centralized  execu&on:      10  TB/day  t  =  0  push  down  preprocess  

click_log  

DCn   adserve_log  

t  =  1  distributed  semi-­‐join  

t  =  2  centralized  k-­‐means  

Page 8: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Geo-distributed analyticspreprocess  adserve_log  

⋈MapReduce  click_log  

DC1   adserve_log  

SQL  

k-­‐means  clustering  

Mahout  preprocess  click_log  

MapReduce  

adserve_log  

click_log  

Distributed  execu&on:      0.03  TB/day  Centralized  execu&on:      10  TB/day  t  =  0  push  down  preprocess  

click_log  

DCn   adserve_log  

t  =  1  distributed  semi-­‐join  

t  =  2  centralized  k-­‐means  

Page 9: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Geo-distributed analyticspreprocess  adserve_log  

⋈MapReduce  click_log  

DC1   adserve_log  

SQL  

k-­‐means  clustering  

Mahout  preprocess  click_log  

MapReduce  

adserve_log  

click_log  

Distributed  execu&on:      0.03  TB/day  Centralized  execu&on:      10  TB/day  

click_log  

DCn   adserve_log  

t  =  0  push  down  preprocess  

t  =  1  distributed  semi-­‐join  

t  =  2  centralized  k-­‐means  

333x  cost  reducKon  

Page 10: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Building a system forGeo-distributed analytics

•  Possible challenges to address:– Bandwidth– Fault tolerance–  – Latency– Consistency

•  Starting point: system we build targets the batch applications considered earlier

Sovereignty

Page 11: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

PROBLEM DEFINITION

Page 12: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Computational model•  DAGs of arbitrary tasks over geo-distributed data•  Tasks can be orwhite box black box

preprocess  adserve_log  

⋈MapReduce  

click_log  

DC1  

adserve_log  

click_log  

DCn  

adserve_log   preprocess  click_log  MapReduce  

SQL  

correlaKon  analysis  

user-­‐provided  code  

Page 13: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Unique characteristics(what make this problem novel)

1.  Arbitrary DAG of computational tasks2.  No control over data partitioning–  Partitioning dictated by external factors,

e.g. end-user latency

3.  Cross-DC bandwidth is only scarce resource–  CPU, storage within DCs is relatively cheap

4.  Unusual constraints:–  heterogeneous bandwidth cost/availability–  sovereignty

5.  Bulk of load is stable, recurring workload–  Consistent with production logs

Page 14: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Problem statement•  Support arbitrary DAG workflows on

geo-distributed data– Minimize bandwidth cost– Handle fault-tolerance, sovereignty

•  Configure system to optimize given ~stable recurring workload (set of DAGs)

Page 15: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

KEY TAKE-AWAY 1:

Geo-distributed analytics is a fun and industrially relevant new instance of classic DB problems

Page 16: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

OUR APPROACH

Page 17: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Architecture

End-­‐users  

End-­‐user  facing  DB  (handles  OLTP)  

Hive   Mahout  

MapReduce  

Local      ETL  

Workload  OpKmizer  

logs  

exec,  repl  policy  

Coordinator  ReporKng  pipeline  

DAGs  

Results  

Data  transfer  opKmizaKon  

Page 18: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data transfer optimization:Trading CPU/storage for bandwidth

•  Runtime optimization that works irresp of computation

•  CPU, storage within DCs is cheap•  Bandwidth crossing DCs is expensive•  This is one way we trade CPU/storage for

bandwidth reduction

Page 19: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data transfer optimization:Caching

•  We use aggressive caching:Cache all intermediate output

•  If computation recurs:–  recompute results–  send diff(new results, old results)

•  Actually worsens CPU, storage use

•  But saves cross-DC bandwidth–  all we care about

rold  

rnew  

rold  

src  

dst  

diff(rnew,                rold)  

Page 20: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data transfer optimization:Caching

•  Caching naturally helps if one DAG arrives repeatedly (intra-DAG)

•  But interestingly: also helpsinter-DAG–  When multiple DAGs share

common sub-operations–  (Because we cache all

intermediate output)•  E.g. TPC-CH–  5.99x for a part of the workload

Page 21: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data transfer optimization:Caching ≈ View maintenance

•  Caching is a low-level, mechanical form of (materialized) view maintenance

+ Works for arbitrary computation- Compared to relational view maintenance•  Is less efficient (CPU, storage)•  Misses some opportunities

Page 22: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

KEY TAKE-AWAY 2:

The extreme ratio of bandwidth toCPU/storage allows for novel optimizations

Page 23: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

WORKLOAD OPTIMIZER

Page 24: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Robust evolutionary approach

•  Start by supporting existing “centralized” plan•  Continuous adaptation (loop):–  Come up with a set of alternative hypotheses –  Measure their costs using pseudo-distributed execution•  Novel mechanism with zero bandwidth-cost overhead

–  Compute new best plan•  Execution strategy•  Data replication strategy

–  Deploy new best plan

Page 25: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Robust evolutionary approach

•  Start by supporting existing “centralized” plan•  Continuous adaptation (loop):–  Come up with a set of alternative hypotheses –  Measure their costs using pseudo-distributed execution•  Novel mechanism with zero bandwidth-cost overhead

–  Compute new best plan•  Execution strategy•  Data replication strategy

–  Deploy new best plan

Page 26: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Robust evolutionary approach

•  Start by supporting existing “centralized” plan•  Continuous adaptation (loop):–  Come up with a set of alternative hypotheses –  Measure their costs using pseudo-distributed execution•  Novel mechanism with zero bandwidth-cost overhead

–  Compute new best plan•  Execution strategy•  Data replication strategy

–  Deploy new best plan

today  (for  rest  see  paper)  

Page 27: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Subproblem definition

•  Given: –  Core workload: a set of recurrent DAGs –  Sovereignty, fault-tolerance requirements

•  Need to decide best choice of: –  Strategy for each task (e.g. hash join vs semi join) –  Which task goes to which DC

Page 28: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Difficulties

1.  Optimizing even one task in isolation is very hard

2.  Should jointly optimize all tasks in each DAG

3.  Should jointly optimize all DAGs in workload–  Caching

4.  Sovereignty, fault-tolerance

Page 29: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Difficulties

1.  Optimizing even one task in isolation is very hard

DAG:  

Data:  

DC1  P1   Q1  

DCn  Pn   Qn  

P  

Q  

Page 30: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Difficulties

1.  Optimizing even one task in isolation is very hard

2.  Should jointly optimize all tasks in each DAG

3.  Should jointly optimize all DAGs in workload–  Recall: caching helps when DAGs share sub-operations

4.  Sovereignty, fault-tolerance

Page 31: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Greedy heuristic

•  Process all DAGs in parallel, separately.In each DAG:– Go over tasks in topological order– For each task, greedily pick

lowest-cost available strategy

Page 32: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

When does the greedy heuristic work?

•  Contractive DAGs: picks optimal strategy–  make up 98% of DAGs in our experiments

filter   aggr   summarize  

Data  size  

extract  features  

combine  

Data  size  

Page 33: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

When does the greedy heuristic work?

•  Contractive DAGs: picks optimal strategy [98%]•  DAGs that expand then contract: may not [2%]

filter   aggr   summarize  

Data  size  

extract  features  

combine  

Data  size  

Page 34: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Optimizing execution:Beyond the heuristic

•  Have a precise ILP formulation for special cases–  SQL-only DAGs–  MapReduce-only DAGs–  (Handles fault-tolerance and sovereignity as

constraints)•  Alternate heuristics

•  General problem remains open

Page 35: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

KEY TAKE-AWAY 3:

The optimization space is massive, yet simple heuristics seem to yield good results

Page 36: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

EVALUATION

Page 37: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Prototype: WANalytics•  Implemented Hadoop-stack prototype– MapReduce, Hive, OpenNLP, Mahout, …

•  Experiments up to 10s of TBs scale– Real Microsoft production workload– Three standard synthetic benchmarks:

BigBench, TPC-CH, Berkeley Big-Data– Mix of relational and non-relational

Page 38: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

0.00001

0.0001

0.001

0.01

0.1

1

10

0.0001 0.001 0.01 0.1 1 10

Data

tran

sfer

TB (c

ompr

esse

d)

TB (raw, uncompressed) Size of OLTP updates since last OLAP run

CentralizedDistributed: no cachingDistributed: with caching

Results: BigBench

330x  

Page 39: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

0.00001

0.0001

0.001

0.01

0.1

1

10

0.0001 0.001 0.01 0.1 1 10

Data

tran

sfer

TB (c

ompr

esse

d)

TB (raw, uncompressed) Size of OLTP updates since last OLAP run

CentralizedDistributed: no cachingDistributed: with caching

Results: TPC-CH

360x  

Page 40: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Data

tran

sfer

Size of OLTP updates since last OLAP run

CentralizedDistributed: no cachingDistributed: with caching

Results: Microsoftproduction workload

257x  

Page 41: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

0.00001

0.0001

0.001

0.01

0.1

1

0.0001 0.001 0.01 0.1

Data

tran

sfer

TB (c

ompr

esse

d)

TB (raw, uncompressed) Size of OLTP updates since last OLAP run

CentralizedDistributed: no cachingDistributed: with caching

Results: Berkeley Big-Data

3.5x  

Page 42: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

KEY TAKE-AWAY 4:

The opportunity here is substantial:more than two orders of magnitude inmany workloads

Page 43: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

OPEN PROBLEMS

Page 44: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Open Problems•  Evolve optimizer beyond greedy•  Even more general computational models– e.g. iteration

•  Latency•  Consistency•  Sovereignty / privacy

Page 45: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Open Problems•  Evolve optimizer beyond greedy•  Even more general computational models– e.g. iteration

•  Latency•  Consistency•  Sovereignty / privacy

Page 46: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Sovereignty: Partial support•  Our system respects “data-at-rest”

regulations (e.g., German data should not be stored outside of Germany)

•  But we allow arbitrary queries on the data•  Limitation: we don’t differentiate between– Acceptable queries, e.g.

“what’s the total revenue from each city”– Problematic queries, e.g.

SELECT * FROM Germany

Page 47: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Sovereignty: Partial support•  Solution: either– Legally vet the core workload of queries/views– Use differential privacy mechanism

•  Open problem

Page 48: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

KEY TAKE-AWAY 5:

This is just the first step, lots of related work, lots of fun work ahead

Page 49: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Related Work•  Distributed and parallel databases•  Single-DC frameworks (Hadoop/Spark/…)•  Data warehouses•  Scientific workflow systems•  Sensor networks•  Stream-processing systems•  …

Page 50: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Unique characteristics(what make this problem novel)

1.  Arbitrary DAG of computational tasks2.  No control over data partitioning–  Partitioning dictated by external factors,

e.g. end-user latency

3.  Cross-DC bandwidth is only scarce resource–  CPU, storage within DCs is relatively cheap

4.  Unusual constraints:–  heterogeneous bandwidth cost/availability–  sovereignty

5.  Bulk of load is stable, recurring workload–  Consistent with production logs

Page 51: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current

Summary•  Centralized analytics is becoming untenable•  Proposal: geo-distributed analytics execution•  WANalytics, our system, introduces– Pseudo-distributed measurement–  Joint multi-query + redundancy optimization– Caching

•  On real and synthetic workloads: up to 360x less bandwidth than centralized

•  Many challenges remain


Recommended