+ All Categories
Home > Documents > xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 ·...

xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 ·...

Date post: 14-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
4
xPAD: A Platform for Analytic Data Flows Alkis Simitsis HP Labs Palo Alto, CA, USA [email protected] Kevin Wilkinson HP Labs Palo Alto, CA, USA [email protected] Petar Jovanovic * Univ. Politècnica de Catalunya Barcelona, Spain [email protected] ABSTRACT As enterprises become more automated, real-time, and data-driven, they need to integrate new data sources and specialized process- ing engines. The traditional business intelligence architecture of Extract-Transform-Load (ETL) flows, followed by querying, re- porting, and analytic operations, is being generalized to analytic data flows that utilize a variety of data types and operations. These complicated flows are difficult to design, implement and maintain since they span a variety of systems. Additionally, new design re- quirements may be imposed such as design for fault-tolerance, fresh- ness, maintainability, sampling, etc. To reduce development time and maintenance costs, automation is needed. We present xPAD, our platform to manage analytic data flows. xPAD enables flow de- sign. We show how these designs can be optimized, not just for performance, but for other objectives as well. xPAD is engine- agnostic. We show how it can generate executable code for a num- ber of execution engines. It can also import existing flows from other engines and optimize those flows. In that way, it can trans- form a flow written for one engine into an optimized flow for a different engine. In our demonstration, we will also use various example flows to show optimization for different objectives and comparison of flow execution on different engines. Categories and Subject Descriptors H.4.m [Information Systems Applications]: Miscellaneous Keywords Analytics, Data Flows, Optimization, Code Generation 1. INTRODUCTION The trend in today’s large enterprises is away from a one-size- fits-all centralized data repository. Business analysis now employs a variety of processing engines, each suited to different data types and analysis tasks. Analytic data flows are becoming longer and more complex and include new design requirements like freshness, * Work done while with HP Labs. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’13, June 22–27, 2013,New York, New York, USA. Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. fault-tolerance, accuracy, and so on. At the same time, IT depart- ments are under pressure to provide faster turn-around on answers to business questions yet be more productive at lower costs. Cur- rently, analytic data flows are designed and implemented manually. This is time consuming and labor intensive. To address this, we demonstrate a platform to automate part of this process. It trans- forms a flow design into a more efficient design that is optimized for a number of objectives. It generates executable code for differ- ent processing engines. By analogy, programmers no longer write in assembly code. Modern languages and optimizers enable them to write at a higher level and to obtain near equal efficient code. We believe the same approach will benefit business analytics. Our demonstration presents xPAD, a cross-engine (x) Platform for managing Analytic Data flows. We highlight two of its capabil- ities. First, starting with a logical specification of a flow, we show how xPAD generates different executable forms for that flow based on user-specified optimization objectives. For example, a flow opti- mized for fault tolerance will differ from that same flow optimized for latency. Second, we show how xPAD can transform a flow writ- ten for one execution engine into a flow for another execution en- gine, where it may have better performance. For example, given a flow written in PigLatin to run on Hadoop, we can generate SQL code for a database or an executable flow for an ETL engine. In the remainder of this paper, we first provide an overview of the xPAD architecture. We then provide some details on the features to be demonstrated. Finally, we outline our audience presentation. 2. SYSTEM OVERVIEW We developed xPAD to manage analytic data flows. It presents a unified interface for applications to create and execute analytic data flows over a diverse collection of data stores and processing engines. xPAD comprises three main components (see Figure 1): API, Optimizer, and Code Dispatcher. These components com- municate using an internal flow representation, called xLM, that captures structural information, requirements, operator properties (e.g., type, schemata, statistics, engine and implementation details, physical characteristics like memory budget), and so on. xLM Layer xLM Layer xLM Converter Flow Parser Design Editor API Optimizer Cost Estimator Ops & Code Lib Trans. & Obj. Lib State Generator xLM Parser Code Dispatcher Engine Plugins Code Generator xLM Converter Figure 1: Analytics engine architecture The API enables users to define logical flows through a GUI or to import flows from a variety of external sources. The flow specifi- cation is a high-level, logical design. Operators need not be bound "© ACM, (2013). This is the author's version of the work. It is posted here by per- mission of ACM for your personal use. Not for redistribution. The definitive version is published in SIGMOD 2013, http://doi.acm.org/10.1145/2463676.2465247"
Transcript
Page 1: xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 · xPAD: A Platform for Analytic Data Flows Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com

xPAD: A Platform for Analytic Data Flows

Alkis SimitsisHP Labs

Palo Alto, CA, [email protected]

Kevin WilkinsonHP Labs

Palo Alto, CA, [email protected]

Petar Jovanovic∗

Univ. Politècnica de CatalunyaBarcelona, Spain

[email protected]

ABSTRACTAs enterprises become more automated, real-time, and data-driven,they need to integrate new data sources and specialized process-ing engines. The traditional business intelligence architecture ofExtract-Transform-Load (ETL) flows, followed by querying, re-porting, and analytic operations, is being generalized to analyticdata flows that utilize a variety of data types and operations. Thesecomplicated flows are difficult to design, implement and maintainsince they span a variety of systems. Additionally, new design re-quirements may be imposed such as design for fault-tolerance, fresh-ness, maintainability, sampling, etc. To reduce development timeand maintenance costs, automation is needed. We present xPAD,our platform to manage analytic data flows. xPAD enables flow de-sign. We show how these designs can be optimized, not just forperformance, but for other objectives as well. xPAD is engine-agnostic. We show how it can generate executable code for a num-ber of execution engines. It can also import existing flows fromother engines and optimize those flows. In that way, it can trans-form a flow written for one engine into an optimized flow for adifferent engine. In our demonstration, we will also use variousexample flows to show optimization for different objectives andcomparison of flow execution on different engines.

Categories and Subject DescriptorsH.4.m [Information Systems Applications]: Miscellaneous

KeywordsAnalytics, Data Flows, Optimization, Code Generation

1. INTRODUCTIONThe trend in today’s large enterprises is away from a one-size-

fits-all centralized data repository. Business analysis now employsa variety of processing engines, each suited to different data typesand analysis tasks. Analytic data flows are becoming longer andmore complex and include new design requirements like freshness,

∗Work done while with HP Labs.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’13, June 22–27, 2013, New York, New York, USA.Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.

fault-tolerance, accuracy, and so on. At the same time, IT depart-ments are under pressure to provide faster turn-around on answersto business questions yet be more productive at lower costs. Cur-rently, analytic data flows are designed and implemented manually.This is time consuming and labor intensive. To address this, wedemonstrate a platform to automate part of this process. It trans-forms a flow design into a more efficient design that is optimizedfor a number of objectives. It generates executable code for differ-ent processing engines. By analogy, programmers no longer writein assembly code. Modern languages and optimizers enable themto write at a higher level and to obtain near equal efficient code. Webelieve the same approach will benefit business analytics.

Our demonstration presents xPAD, a cross-engine (x) Platformfor managing Analytic Data flows. We highlight two of its capabil-ities. First, starting with a logical specification of a flow, we showhow xPAD generates different executable forms for that flow basedon user-specified optimization objectives. For example, a flow opti-mized for fault tolerance will differ from that same flow optimizedfor latency. Second, we show how xPAD can transform a flow writ-ten for one execution engine into a flow for another execution en-gine, where it may have better performance. For example, given aflow written in PigLatin to run on Hadoop, we can generate SQLcode for a database or an executable flow for an ETL engine.

In the remainder of this paper, we first provide an overview of thexPAD architecture. We then provide some details on the features tobe demonstrated. Finally, we outline our audience presentation.

2. SYSTEM OVERVIEWWe developed xPAD to manage analytic data flows. It presents

a unified interface for applications to create and execute analyticdata flows over a diverse collection of data stores and processingengines. xPAD comprises three main components (see Figure 1):API, Optimizer, and Code Dispatcher. These components com-municate using an internal flow representation, called xLM, thatcaptures structural information, requirements, operator properties(e.g., type, schemata, statistics, engine and implementation details,physical characteristics like memory budget), and so on.

xLM

Lay

er

xLM

Lay

er

xLM Converter

Flow Parser

Design Editor

API

Opt

imiz

er

Cost Estimator

Ops & Code Lib

Trans. & Obj. Lib

StateGenerator

xLM Parser

Code

Dis

patc

her

Engine Plugins

Code Generator

xLM Converter

Figure 1: Analytics engine architecture

The API enables users to define logical flows through a GUI orto import flows from a variety of external sources. The flow specifi-cation is a high-level, logical design. Operators need not be bound

"© ACM, (2013). This is the author's version of the work. It is posted here by per-mission of ACM for your personal use. Not for redistribution. The definitive version is published in SIGMOD 2013, http://doi.acm.org/10.1145/2463676.2465247"

Page 2: xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 · xPAD: A Platform for Analytic Data Flows Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com

(a) PDI implementation (b) SQL implementation (c) Apache PigLatin implementationFigure 3: Example renditions of the flow depicted in Figure 2

to any particular implementation or engine and data sets need notbe bound to any storage repository. Our GUI design editor ex-tends the open-source ETL tool, Pentaho PDI (a.k.a. Kettle) [2].We modified PDI to import and export flows as xLM or in otherformats. A flow can be created using PDI. Like other design tools,PDI provides a design canvas where flow operators are dragged-and-dropped from a palette to a canvas. Alternatively, flows maybe imported from other sources like workflow or other ETL enginesor from scripts written in languages like SQL, PigLatin, Hive, etc.Regardless of how it is is created, a flow is parsed (Flow Parser)and encoded in xLM (xLM Converter).

The Optimizer compiles the flow into an optimized form. Ittreats the flow as a directed acyclic graph. Given a graph and aset of objectives, the Optimizer generates an alternative, function-ally equivalent graph that satisfies the objectives [5, 6]. The ob-jectives are encoded as objective functions stored in a library (Obj.Lib). These are plugins and may be easily extended. Based on thedesired objectives the appropriate optimization strategies (i.e., tran-sitions stored in Trans. Lib) are applied to generate a state space ofalternative graphs (State Generator) (see Figure 8(b)). The Opti-mizer connects to an operator library (Ops. Lib.). Each operatorhas a generic (logical) type and one or more physical implemen-tations. Each implementation has an associated cost function andcode template. The costs of individual operators in the flow arecombined to compute the overall cost of a complete flow graph.That cost is used by heuristics to prune the search space of alterna-tive graphs and to find the best solution for the objectives. For eachoperator implementation in the optimized flow, the Code Genera-tor instantiates its code template in order to generate the executablecode for a specific engine. A code template may be written in anumber of languages (e.g., SQL, PigLatin, Java).

Note that an execution engine may further optimize a flow frag-ment for performance using its own optimizer. In fact, we expectan engine to do so and to do a better job than xPAD, which has littleknowledge of the engine internals. However, we view an engine’sinternal optimization as complementary to what xPAD is doing.

3. DEMONSTRABLE FEATURES OF xPADThis section describes the xPAD features we will demonstrate.

We discuss, in turn, relevant aspects of the interface, optimizer, andcode generation. To facilitate the presentation, we use an exampleanalytic flow that integrates structured and unstructured data. Theflow produces a report to assess the effectiveness of a product mar-keting campaign. The report combines sales data for a product mar-keting campaign with sentiments about that product gleaned from

sentimentanalysis lookup lookup extrDay

rollup

rollup

filter

join

join

tweet

sales

products region

campaign report

pID, region,dayBeg, dayEnd

pID,region,day

tweet, {tag, sentiment} pID: f1(tag) region: f2(tweet) day: f3(tweet) pID,region,day

pID,region,day

pID,region,day:f4(dayBeg,dayEnd)

Figure 2: An example data flow

tweets crawled from the Web. The report lists total sales and aver-age sentiment for each day of the campaign. Campaigns promotea specific product and are targeted at non-overlapping, geographi-cal regions. To simplify the presentation, we assume the sentimentanalysis of a tweet yields a single metric, i.e., like or dislike theproduct over a range of -5 to +5. Figure 2 illustrates a logical flowfor the report.

Design of an analytic flow. As discussed, an xPAD flow can becreated using the xPAD GUI design tool or can be imported fromsome other design tool or script. Figure 3 illustrates renditionsof our example flow in three forms. The leftmost implementationshows the flow as it appears in PDI. The center and rightmost showthe flow expressed in SQL and PigLatin, respectively. Among ourchanges to PDI was to add a menu tab to enable import, optimiza-tion, and export (either as xLM or as executable code for anotherengine as we describe shortly) of flows (see Figure 4). Importingeither the SQL or PigLatin flow into xPAD will produce a flow thatappears in our GUI like Figure 3(a). Note that this is a logical flowintended for exposition. It is not yet optimized.

For translating a flow to xLM, we parse the input flow, identifyoperators and data stores, and map them to a library of operatorssupported by xPAD. We support a fairly large number of operators(more than a hundred) and it is relatively easy to register a new oneusing xPAD wizard for modifying and registering operators. Forthis task, and for other administrative functions as we show nexttoo, we also provide an administration console. For example, Fig-ure 5 has a partial list of the supported operators (left) and attributesof a selected operator (right). The attributes include functions tocompute cardinality and processing cost, templates for producingexecution code for different engines, and properties of the operator(e.g., blocking, order-preserving). For example, Figure 5 shows aSelectRows operator with code templates tabs for SQL/Vertica andHadoop/PigLatin. Other operators may have templates for differentSQL engines or languages, different costs and properties. The tem-

Page 3: xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 · xPAD: A Platform for Analytic Data Flows Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com

Figure 4: PDI menu tab for xPAD Figure 7: Statistics collection Table 1: Alternative solutions for various objectives

Figure 5: Registering new operators or modifying existing ones

plates and cost functions are encoded in JavaScript. If an operatoris added with incomplete attributes, defaults are used so that opti-mization is still possible, though sub-optimal. To facilitate workingwith functions, xPAD includes a handy code editor where one maywrite and test JavaScript functions (see Figure 6 –top: code testbed,bottom: console reporting results of code execution).

Accurate cost models are critical to the effectiveness of the op-timizer. Initially, the cost function for an operator is based on ex-tensive micro-benchmarks performed during the calibration of theoptimizer [3]. However, a user may augment these offline resultswith statistics collected when a flow is executed. The user may de-cide to collect statistics of the flow over a sample of its input data.For example, in our GUI, we provide the user with the option tocollect statistics, as shown in Figure 7. We offer two options: us-ing either reservoir sampling (seed and sampling size are tunableparameters) or uniform sampling (the sampling size can be tuned).We use standard techniques for propagating the sampling to the in-put datasets of the flow (e.g., as in [1, 8]), but we have taken extracare of having these methods as a plugin to our system and thus,one may choose a different sampling method.

Optimization for multiple objectives. One of the distinctive fea-tures of xPAD is that a flow can be optimized for more than one ob-jective. We formulate optimization as a state space search problem.Starting from the initial flow we apply a series of transitions (likeoperator swap, flow parallelization or checkpoint), each producingan alternative, functionally equivalent flow. The user may use a de-fault search strategy or choose among various algorithms for creat-ing the state space. These range from fast methods based on heuris-tics to slow, exhaustive ones that can be used for experimentationand evaluation of the faster, non-exhaustive ones. Some of thesemethods and heuristics have been described elsewhere [4, 5, 6].

Figure 6: Code editor

The administration console enables users to interactively moni-tor the optimization process. Several functions are available and theoptimizer can be extended and modified. Figures 5 and 6 show afew examples, but a large number of other parameters can be con-figured as well. A flow can be explored using the xPAD internalrepresentation (see Figure 8(a)). For any flow in the state space, wecan see details about that flow (with a mouse-right-click on a node)such as flow statistics, its encoding in xLM, and instantly producedexecution code (discussed shortly).

Figure 8(b) illustrates a part of the state space for optimizing theflow of Figure 2 when performance is the objective. The green andred nodes indicate the initial and the optimal flows, respectively,and arcs between nodes indicate graph transitions. With a mouse-right-click on any state, the user may study it as in Figure 8(a).Statistics like the minimum cost state, time/memory used, statesper transition, etc. are available. The path from the green to the redstate can be graphically highlighted to show all transitions frominitial to optimized flow.

The state space is expanded according to cost functions basedon the given optimization objectives. Table 1 illustrates possiblesolutions for optimizing the example flow for four possible objec-tives: performance, in terms of the execution cost; maintainability,which can be measured as a combination of different metrics eval-uating how vulnerable a flow is in a potential change or how easyit may be to absorb such a change; and recoverability, soft andhard, which is the ability of a flow to recover from failure withoutrestarting from the beginning. In addition to the individual objec-tives, each flow has a global score, computed as a function of theflow behavior across all metrics (lower is better) and an assessmentof the best use for the flow, if any. For our example, solution ]441(Figure 10), is optimal for performance but not necessarily optimalfor recoverability or maintainability. Solution ]951, the best forsoft recoverability as the objective, is estimated to perform almostas well as ]441 (which has more parallelism) with the extra benefitof having 3 recovery points. Thus, ]951 has faster recovery so, inthe presence of faults, it should have better performance than ]441.This is reflected in their global scores that rate ]951 slightly betterthan ]441. For other objectives, like maintainability or hard recov-erability, the table shows other solutions are preferred. For multiple

Page 4: xPAD: A Platform for Analytic Data Flowsasimi/publications/sigmod13-xpad.pdf · 2015-06-01 · xPAD: A Platform for Analytic Data Flows Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com

(a) Flow representation in xPAD (b) Example state spaceFigure 8: Screenshots of xPAD for the flow of Figure 2

0 100 200 300 400 500

sql‐bl

sql‐pip

mr

etl

sec

enginesorig

optim

Figure 9: Impact of optimization onflow execution time Figure 10: PDI implementation of the example flow optimized for performance

objectives, xPAD may suggest more than one solution –each beingbest in a sole objective– but also suggests solutions that performbest in all objectives like the last two ]348 and ]1233 in the table.

Code generation and execution. For any given flow in the searchspace, xPAD generates executable code for the engine of choice.Figure 8(a) shows code generation options (right pane) for a flow(left pane). Example options include a database engine (e.g., Ver-tica, PostgreSQL), a Map-Reduce engine like Hadoop (e.g., PigLatin,Hive), and an ETL engine (e.g., PDI). The benefits of optimizationand cross-engine capabilities are illustrated in Figure 9. This figureshows the execution time of our example flow when executed onvarious engines in both original and optimized forms. (Note thatthis is an example measurement, the engines were configured ondifferent systems, and thus, it should not be used to compare theengines themselves.)

For some engine types, users may influence the style of gener-ated code. Figure 8(a) shows a snippet of SQL code for Vertica,where a user may specify the nesting level of generated SQL over arange from highly nested (pipelined, sql-pip) to blocking (separateSQL statements connected through temporary tables, sql-bl). Vary-ing the nesting level often enables more fair sharing of resources bydecomposing a long flow into multiple smaller flows, but also as-sists designers and analysts in debugging and understanding theirflows better [7].

xPAD can also be used to convert a flow from one engine to an-other. For example, assuming that the input flow was the one de-picted in Figure 3(a), then xPAD could produce any of the scriptsof Figures 3(b) and 3(c), and vice versa.

4. OUR PRESENTATIONOur presentation script will utilize the marketing campaign ex-

ample flow described in this paper. We will show how the flow canbe imported from scripts in different languages and then optimizedfor different objectives. We will compare the flows visually andby executing them. We will also show the cross-engine capability

of xPAD by generating executable code for our example flow fordifferent execution engines.

For off-script presentation and discussion, we will provide inter-activity, where the participants can browse example analytic sce-narios and experiment with data flows already created and stored inour system. The participants can also modify these flows or createnew flows from scratch. Depending on the participants’ interests,we can show (a) how new operators are added, including cost andcardinality functions; (b) how cost functions may be tuned by run-ning a flow with sampling for statistics collection; (c) how to tunethe optimizer by selecting different strategies; and (d) how we pro-duce different styles of code (e.g., pipelined, semi-blocking, etc.).The participants may also review the internals of xPAD, for exam-ple, details of xLM, our internal encoding of flows, and so on. Fi-nally, the participants can see the actual benefit and tradeoffs of themulti-objective optimization, by studying the results of the actualflow execution on different execution engines.

5. REFERENCES[1] S. Chaudhuri, R. Motwani, and V. R. Narasayya. On random sampling

over joins. In SIGMOD Conference, pages 263–274, 1999.[2] Pentaho PDI. url: http://kettle.pentaho.com/ (version 4.3.0), 2012.[3] A. Simitsis and K. Wilkinson. Revisiting ETL Benchmarking: The

Case for Hybrid Flows. In TPCTC, 2012.[4] A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. QoX-driven

ETL Design: Reducing the Cost of ETL Consulting Engagements. InSIGMOD, pages 953–960, 2009.

[5] A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. OptimizingAnalytic Data Flows for Multiple Execution Engines. In SIGMOD,pages 829–840, 2012.

[6] A. Simitsis, K. Wilkinson, U. Dayal, and M. Castellanos. OptimizingETL Workflows for Fault-Tolerance. In ICDE, pages 385–396, 2010.

[7] A. Simitsis, K. Wilkinson, U. Dayal, and M. Hsu. HFMS: Managingthe Lifecycle and Complexity of Hybrid Analytic Data Flows. InICDE, 2013.

[8] J. Spiegel and N. Polyzotis. Graph-based synopses for relationalselectivity estimation. In SIGMOD Conference, pages 205–216, 2006.


Recommended