What-if Analysis for Data Warehouse Evolutiongpapas/Publications/DataWarehouseEvoluti… · what-if...

What-if Analysis for Data Warehouse Evolution

George Papastefanatos1, Panos Vassiliadis2, Alkis Simitsis3, and Yannis Vassiliou1

1 National Technical University of Athens, Dept. of Electrical and Computer Eng., Athens, Hellas gpapas,[email protected]

2 University of Ioannina, Dept. of Computer

Science, Ioannina, Hellas [email protected]

3 IBM Almaden Research Center,

San Jose, California, USA [email protected]

Abstract. In this paper we discuss the problem of performing what-if analysis for changes that occur in the schema/structure of the data warehouse sources. We abstract software modules, queries, reports and views as (sequences of) queries in SQL enriched with functions. Queries and relations are uniformly modeled as a graph that is annotated with policies for the management of evolution events. Given a change at an element of the graph, our method detects the parts of the graph that are affected by this change and indicates the way they are tuned to respond to it.

1. Introduction

Data warehouses are complicated software environments where data stemming from operational sources are extracted, transformed, cleansed and eventually stored in fact or dimension tables in the data warehouse. Once this task has been successfully completed, further aggregations of the loaded data are also computed and stored in data marts, reports, spreadsheets, and other formats. The whole environment involves a very complicated architecture, where each module depends upon its data providers to fulfill its task. This strong flavor of inter-module dependency makes the problem of evolution very important in data warehouses.

Observe Fig.1, where a simplified version of a real-world ETL process is depicted. Data are extracted from sources, their contents and structure are modified, joined, new attributes are calculated via functions and the results are stored in fact tables and materialized views. Assume now that an attribute has to be deleted from the underlying database S1 or added to the base relation S2. Small changes like these might impact the whole workflow, possibly all the way to the warehouse (tables T1 and T2), along with any reports over the warehouse tables (abstracted as queries over view V3).

S1 Tmp1

L1 L2

Tmp2

Join

S2

T1

T2

Add field

Load to DSA

γ V3

Load to DSA π

πAdd field

Patch text fields

Sources DW Data Staging Area (DSA)

Patch text fields

Fig. 1: A simple ETL workflow

Research has extensively dealt with the problem of schema evolution, in object-

oriented databases [1, 11, 15], ER diagrams [21], data warehouses [6, 16, 17, 18] and materialized views [2, 5, 6, 8]. Although several problems of evolution have been considered in the related literature, to the best of our knowledge, there is no global framework for the management of evolution in the described setting.

For example, assume that the warehouse designer wishes to add an attribute to the base relation S2. Should this change be propagated to the view or the query? Although related research can handle the deletion of attributes due to the obvious fact that queries become syntactically incorrect, the addition of information is deferred to a decision of the designer. Similar considerations arise when the WHERE clause of a view is modified. Assume that the view definition is modified by incorporating an extra selection condition. Can we still use the view in order to answer existing queries (e.g., reports) that were already defined over the previous version of the view? The answer is not obvious, since it depends on whether the query uses the view simply as a macro (in order to avoid the extra coding effort) or, on the other hand, the query is supposed to work on the view, independently of what the view definition is [22]. The problem lies in the fact that there is no semantic difference in the way one defines the query over the view; i.e., we define the view in the same manner in both occasions.

Our approach, in this paper, is to provide a general mechanism for performing what-if analysis [19] for potential changes of data source configurations. A graph model that uniformly models relations, queries, views, ETL activities and their significant properties (e.g., conditions) is introduced. Apart from the simple task of capturing the semantics of a database system, the graph model allows us to predict the impact of a change over the system. Furthermore, we provide a framework for annotating the database graph with policies concerning the behavior of nodes in the presence of hypothetical changes. Finally, rules that dictate the proper actions, when additions, deletions or updates are performed to relations, attributes and conditions (all treated as first-class citizens of the model) are provided. In other words, assuming that a graph construct is annotated with a policy for a particular event (e.g., an activity node is tuned to deny deletions of its provider attributes), the proposed framework (a) performs the identification of the affected subgraph and, (b) if the policy is appropriate, automates the readjustment of the graph to fit the new semantics imposed by the change. Finally, we experimentally assess our proposal.

Outline. Section 2 presents the graph model for databases. Section 3 proposes a framework of graph annotations and readjustment automation for database evolution. A case study of our framework is presented in Section 4. Section 5 discusses related work. Finally, Section 6 concludes and provides insights for future work.

2. Graph-based modeling of ETL processes

In this section, we propose a graph modeling technique that uniformly covers relational tables, views, ETL activities, database constraints and SQL queries as first class citizens. The proposed technique provides an overall picture not only for the actual source database schema but also for the ETL workflow, since queries that represent the functionality of the ETL activities are incorporated in the model.

The proposed modeling technique represents all the aforementioned database parts as a directed graph G=(V,E). The nodes of the graph represent the entities of our model, where the edges represent the relationships among these entities. Preliminary versions of this model appear in [9,10].

The constructs that we consider are classified as elementary, including relations, conditions and queries and composite, including views and ETL activities and ETL processes. Composite elements are combinations of elementary ones.

Relations, R. Each relation R(Ω1,Ω2,…,Ωn) in the database schema, either a table or a file (it can be considered as an external table), is represented as a directed graph, which comprises: (a) a relation node, R, representing the relation schema; (b) n attribute nodes, Ωi∈Ω, i=1..n, one for each of the attributes; and (c) n schema relationships, ES, directing from the relation node towards the attribute nodes, indicating that the attribute belongs to the relation.

Conditions, C. Conditions refer both to selection conditions, of queries and views and constraints, of the database schema. We consider three classes of atomic conditions that are composed through the appropriate usage of an operator op belonging to the set of classic binary operators, Op (e.g., <, >, =, ≤, ≥, !=, IN, EXISTS, ANY): (a) Ω op constant; (b) Ω op Ω’; and (c) Ω op Q. (Ω, Ω’ are attributes of the underlying relations and Q is a query.)

A condition node is used for the representation of the condition. Graphically, the node is tagged with the respective operator and it is connected to the operand nodes of the conjunct clause through the respective operand relationships, O. Composite conditions are easily constructed by tagging the condition node with a Boolean operator (e.g., AND or OR) and the respective edges, to the conditions composing the composite condition.

Well-known constraints of database relations – i.e., primary/foreign key, unique, not null, and check constraints – are easily captured by this modeling technique. Foreign keys are subset relations of the source and the target attribute, check constraints are simple value-based conditions. Primary keys, which are unique-value constraints, are explicitly represented through a dedicated node tagged by their names and a single operand node.

Queries, Q. The graph representation of a Select - Project - Join - Group By (SPJG) query involves a new node representing the query, named query node, and attribute

nodes corresponding to the schema of the query. The query graph is therefore a directed graph connecting the query node with all its schema attributes, via schema relationships. In order to represent the relationship between the query graph and the underlying relations, we resolve the query into its essential parts: SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY, each of which is eventually mapped to a subgraph.

Select part. Each query is assumed to own a schema that comprises the attributes, either with their original or alias names, appearing in the SELECT clause. In this context, the SELECT part of the query maps the respective attributes of the involved relations to the attributes of the query schema through map-select relationships, EM, directing from the query attributes towards the relation attributes.

From part. The FROM clause of a query can be regarded as the relationship between the query and the relations involved in this query. Thus, the relations included in the FROM part are combined with the query node through from relationships, EF, directing from the query node towards the relation nodes.

Where and Having parts. We assume that the WHERE and/or the HAVING clauses of a query are in conjunctive normal form. Thus, we introduce two directed edges, namely where relationships, Ew, and having relationships, EH, both starting from a query node towards an operator node corresponding to the conjunction of the highest level.

Concerning nested queries, we extend the WHERE subgraph of the outer query by: (a) constructing the respective graph for the subquery; (b) employing a separate operator node for the respective nesting operator (e.g., IN operator); and (c) employing two operand edges directing from the operator node towards the two operand nodes: the attribute of the outer query and the respective attribute of the inner query, in the same way that conditions are represented in simple queries.

Group and Order By part. For the representation of aggregate queries, we employ two special purpose nodes: (a) a new node denoted as GB∈GB, to capture the set of attributes acting as the aggregators; and (b) one node per aggregate function labeled with the name of the employed aggregate function; e.g., COUNT, SUM, MIN. For the aggregators, we use edges directing from the query node towards the GB node that are labeled <group-by>, indicating group-by relationships, EG. Then, the GB node is connected with each of the aggregators through an edge tagged also as <group-by>, directing from the GB node towards the respective attributes. These edges are additionally tagged according to the order of the aggregators; we use an identifier i to represent the i-th aggregator. Moreover, for every aggregated attribute in the query schema, there exists an edge directing from this attribute towards the aggregate function node as well as an edge from the function node towards the respective relation attribute. Both edges are labeled <map-select> and belong to EM, as these relationships indicate the mapping of the query attribute to the corresponding relation attribute through the aggregate function node.

The representation of the ORDER BY clause of the query is performed similarly. Functions, F. Functions used in queries are integrated in our model through a

special purpose node Fi∈F, denoted with the name of the function. Each function has an input parameter list comprising attributes, constants, expressions, and nested functions, and one (or more) output parameter(s). The function node is connected

with each input parameter graph construct, nodes for attributes and constants or sub-graph for expressions and nested functions, through an operand relationship directing from the function node towards the parameter graph construct. This edge is additionally tagged with an appropriate identifier i that represents the position of the parameter in the input parameter list. An output parameter node is connected with the function node through a directed edge E∈O∪EM∪EG∪EO from the output parameter towards the function node. This edge is tagged based on the context, in which the function participates. For instance, a map-select relationship is used when the function participates in the SELECT clause, and an operand relationship for the case of the WHERE clause.

Views, V. Views are considered either as queries or relations (materialized views), thus, V ⊆ R∪Q.

ETL activities, A. An ETL activity is modeled as a sequence of SQL views. An ETL activity necessarily comprises: (a) one (or more) input view(s), populating the input of the activity with data coming from another activity or a relation; (b) an output view, over which the following activity will be defined; and (c) a sequence of views defined over the input and/or previous, internal activity views.

ETL summary, S. An ETL summary is a directed acyclic graph Gs=(Vs,Es) which acts as a zoomed-out variant of the full graph G [20]. Vs comprises of activities, relations and views that participate in an ETL process. Es comprises the edges that connect the providers and consumers. Conversely to the overall graph where edges denote dependency, edges in the ETL summary denote data provision. The graph of the ETL summary can be topologically sorted and therefore, execution priorities can be assigned to activities. Figure 1 depicts an ETL summary.

Fig. 2. Graph of Aggregate Query [10]

Components. A component is a sub-graph of the graph in one of the following patterns: (a) a relation with its attributes and all its constraints, (b) a query with all its attributes, functions and operands. Modules are disjoint and they are connected through edges concerning foreign keys, map-select, where, and so on.

Figure 2 depicts the proposed graph representation for the following aggregate query:

Q: SELECT EMP.Emp# as Emp#, Sum(WORKS.Hours) as T_Hours FROM EMP, WORKS WHERE EMP.Emp# = WORKS.Emp# GROUP BY EMP.Emp#

As far as modification queries are concerned, there is a straightforward way to incorporate them in the graph, too. In general, their behavior with respect to adaptation to changes in the database schema can be captured by SELECT queries. For lack of space, we simply mention that (a) INSERT statements can be dealt as simple SELECT queries and (b) DELETE and UPDATE statements can also be treated as SELECT queries, possibly comprising a WHERE clause.

3. Adapting ETL workflows for evolution of sources

In this section, we formulate a set of rules that allow the identification of the impact of changes to an ETL workflow and propose an automated way to respond to these changes. The impact of the changes affects the software used in an ETL workflow – mainly queries, stored procedures, triggers, etc. – in two ways: (a) syntactically, a change may evoke a compilation or execution failure during the execution of a piece of code; and (b) semantically, a change may have an effect on the semantics of the software used.

The proposed rules annotate the graph representing the ETL workflow with actions that should be taken when a change event occurs. The combination of events and annotations determines the policy to be followed for the handling of a potential change. The annotated graph is stored in a metadata repository and it is accessed from a what-if analysis module. This module notifies the designer or the administrator on the effect of a potential change and the extent to which the modification to the existing code can be fully automated, in order to adapt to the change.

3.1 The general framework for schema evolution

The main mechanism towards handling schema evolution is the annotation of the constructs of the graph (i.e., nodes and edges) with elements that facilitate what-if analysis. Each such construct is enriched with policies that allow the designer to specify the behavior of the annotated construct whenever events that alter the database graph occur. The combination of an event with a policy determined by the designer/administrator triggers the execution of the appropriate action that either blocks the event, or reshapes the graph to adapt to the proposed change.

The space of potential events comprises the Cartesian product of two subspaces; specifically the space of hypothetical actions (addition/ deletion/modification) over graph constructs sustaining evolution changes (relations, attributes and conditions).

For each of the above events, the administrator annotates graph constructs affected by the event with policies that dictate the way they will regulate the change. Three kinds of policies are defined: (a) propagate the change, meaning that the graph must be reshaped to adjust to the new semantics incurred by the event; (b) block the

change, meaning that we want to retain the old semantics of the graph and the hypothetical event must be blocked or, at least, constrained, through some rewriting that preserves the old semantics [8, 13]; and (c) prompt the administrator to interactively decide what will eventually happen. For the case of blocking, the specific method that can be used is orthogonal to our approach, which can be performed using any available method [8, 13].

The definition of policies on each part of the system affected by a specific event involves the annotation of the respective construct (i.e., node or edge) in our graph framework. Table 1 presents the allowed annotations of graph constructs for each kind of event. The annotation is performed as follows. For the constructs belonging to database schemata (i.e. relations, attributes and condition nodes) we annotate the respective nodes, whereas for the constructs belonging to the queries/views affected by the schema change, we primarily annotate the edges connecting the respective nodes with the database nodes. Specifically, (a) we annotate the FROM edges connecting query nodes with relation nodes with policies defined on views/queries; (b) we annotate the map-select or group by edges with rules defined on query attribute nodes; and (c) we annotate operand edges with rules defined on query condition nodes.

Our framework prescribes the reaction of the parts of the system affected by a hypothetical schema change based on their annotation with policies. The correspondence between the examined schema changes and the parts of the system affected by each change is shown in Table 1.

parts of the system affected nodes annotated with policies edges annotated with policies

event on database schema R/V R/V

Attr. R/V

Cond. Q/V Q/V Attr.

Q/V Cond. R Ω V C Map-

Select Op From GB/OB

Ω √ √ √ √ √ C √ √ √ √ √ √ √ Add

R/V Ω √ √ √ √ √ √ √ √ √ √ √ √ √ C √ √ √ √ √ √ √ √ √ Delete

R/V √ √ √ √ √ Ω √ √ √ √ √ √ √ √ √ √ √ √ √ √ C √ √ √ √ √ √ √ √ √ Modify/

Rename R/V √ √ √ √ √

Table 1: Parts of the system affected by each event and annotation of graph constructs with policies for each event

In Table 2 we present an overall picture of the framework. Potential events tested by the designer/administrator are depicted in the first column of the Table. The two rightmost columns depict the possible policies that the administrator could have set and the actions dictated by our framework. For each event, the candidate modules for change are also presented as well as the type of impact (i.e., semantic or syntactical) the change has on them.

Event on source schema

Candidate Modules For Change Impact Prevailing

Condition* Action

1.Queries/views that must include the added attribute in the SELECT clause 2.Queries/views with SELECT * clause that must exclude the added attribute

Semantic Policy =1 Include attribute in SELECT clause

Policy = 2 Rewrite SELECT clause excluding added attribute

Ω

Policy = 3 One of the above Policy =1 Leave query intact

Policy = 2 Retain old view (without the added condition) and all queries with block policy refer to the old view C

Queries/views referring to the relation/view over which the condition is added.

Semantic

Policy = 3 One of the above

Add

R / V No direct impact

Queries/views referring to this attribute (i.e. in the SELECT clause, WHERE clause, etc.)

Syntactical, Semantic Policy =1 Remove deleted attribute from query/view definition (i.e.,

SELECT, WHERE, GROUP BY clause)

Policy = 2 Rewrite properly query/view in order to be valid.

Ω


Policy = 2 Retain old view (including the original condition) and all queries with block policy refer to the old view C

Queries/views referring to relation/view from which the condition is removed

Semantic


Policy =1 Remove relation from query/view definition (i.e., FROM clause) along with the attributes and conditions involving this relation (i.e., SELECT, WHERE, GROUP BY clauses)

Policy = 2 Rewrite properly query/view in order to be valid

Delete

R / V Queries/views referring to relation/view

Syntactical, Semantic

Policy = 3 One of the above Queries/views referring to this attribute (i.e. in the SELECT clause, WHERE clause, etc.)

Syntactical, Semantic Policy =1 Rename modified attribute in the query/view definition (i.e.,

SELECT, WHERE, GROUP BY clause, etc)


Ω


Policy = 2 Retain old view (including the original condition) and all queries with block policy refer to the old view C

Queries/views referring to relation/view of which the condition is modified

Semantic


Policy =1 Rename relation in the query/view definition (i.e., FROM clause)


Modify/Rename

R / V Queries/views referring to relation/view

Syntactical, Semantic

Policy = 3 One of the above * Policy Types: 1=Propagate, 2=Block, 3=Prompt (def)

Table 2: Actions determined by combinations of events and policies examples

The mechanism that determines the reaction to a change is formally described by the algorithm Propagate changeS (PS) in Figure 3. Given an ETL summary S over a graph Go and an event e, PS produces a new ETL summary Gn, which has absorbed the changes.

Algorithm Propagate changeS (PS) Input: an ETL summary S over a graph Go=(Vo,Eo) and an event e Output: a graph Gn=(Vn,En) Variables: a set of events E, and an affected node A Begin dps(S, Go, Gn, e, A) End

dps I = Ins_by_policy(affected(E))

(S, Gn, Go, E, A)

D = Del_by_policy(affected(E)) Gn = Go – D ∪ I E = E–e∪action(affected(E)) if consumer(A)≠nil for each consumer(A) dps(S,Gn,Go,E,consumer(A))

Fig. 3. Algorithm Propagate changeS (PS)

Given an event altering the source database schema, PS determines those activity graphs that are affected and, also, they are directly related to the source altered. Then, the changes are propagated to the internals of each module of the ETL summary and generate some actions according to the policies enforced: propagate, block or prompt. According to the prevailing policy, the corresponding action dictated by Table 2, is (automatically, if possible) taken to adjust the affected constructs to the new schema. Subsequently, both the initial changes, along with the readjustment caused by the respective actions, are recursively propagated as new events to the consumers of the activity graph. Theorem 1. The algorithm PS terminates.

Proof. The algorithm operates over a finite graph. The only possibility for an infinite execution of the algorithm concerns cycles. The ETL summary is defined as a DAG, therefore, the only possibility for cycles lies within each module (e.g., joins of the form A=B∧B=C∧C=A). To handle this problem, we resort to a simple method: a unique session id is generated per event e. For each node affected by e, a log entry is kept. A default policy Prompt is raised whenever a cycle is detected within a query. The session id’s are cleared at the end of each function call.

To better demonstrate the necessity and the functionality of our approach, consider the case of the addition of an attribute to a source relation. Then, the queries to which the addition must be reflected and propagated should be identified. Both the commercial database systems and state of the art in research do not react to such a change, but rather, they let the designer or the administrator manually propagate the change to the queries involved. This procedure usually requires manual query rewriting. This treatment is mainly due to the fact that: (a) the addition of an attribute does not always syntactically affect the involved queries (i.e., the existing queries can still be executed without any problem); and (b) up to now, no mechanism has been proposed to inform the system that once an attribute is added to a relation, it must also be added to certain queries that access this particular relation.

Example. Consider the simple example query SELECT * FROM EMP as part of an ETL activity. Assume that provider relation EMP is extended with a new attribute PHONE. There are two possibilities: - The * notation signifies the request for any attribute present in the schema of

relation EMP. In this case, the * shortcut can be treated as “return all the attributes

that EMP has, independently of which these attributes are”. Then, the query must also retrieve the new attribute PHONE.

- The * notation acts as a macro for the particular attributes that the relation EMP originally had. In this case, the addition to relation EMP must not be further propagated to the query.

A naive solution to a modification of the sources; e.g., addition of an attribute, would be that an impact prediction system must trace all queries and views that are potentially affected and ask the designer to decide upon which of them must be modified to incorporate the extra attribute. We can do better by extending the current modeling. For each element affected by the addition, we annotate its respective graph construct (i.e., node, edges) with the policies mentioned before. According to the policy defined on each construct the respective action is taken to correct the query.

Therefore, for the example event of an attribute addition, the policies defined on the query and the actions taken according to each policy are: - Propagate attribute addition. When an attribute is added to a relation appearing in

the FROM clause of the query, this addition must be reflected to the SELECT clause of the query.

- Block attribute addition. The query is immune to the change: an addition to the relation is ignored. In our example, the second case is assumed, i.e., the SELECT * clause must be rewritten to SELECT A1,…,An without the newly added attribute.

- Prompt. In this case (default, for reasons of backwards compatibility) the designer or the administrator must handle the impact of the change manually; in the same way that currently happens in database systems.

from

map-select

S

Q

S SS

EMP

PhoneEmp# NameEmp#

NameSmap-select

...

On attribute additionthen propagate

Fig. 4: Propagating addition of attribute PHONE

The graph of the query SELECT * FROM EMP is shown in Figure 4. The annotation of the FROM edge as propagating addition indicates that the addition of PHONE node will be propagated to the query and the new attribute is included in the SELECT clause of the query. If a FROM edge is not tagged with this additional information, then a default case is assumed and the designer/administrator is prompted to decide.

3.2 Conflict resolution in the context of schema evolution

It is possible that the policies defined over the different elements of the graph do not always align towards the same goal. For instance, consider the case where a view V is defined over a database relation R and a query Q accesses the view V. Assume that R is

annotated with the policy ‘propagate’, i.e., the modification of the graph is allowed, when a new attribute is added:

ON attribute addition THEN propagate

whereas V is annotated with the policy ‘block’ in case that an event, e.g., an attribute addition, occurs on underlying relations:

ON attribute addition THEN block Independently of what the relation policy is, the attribute propagation must be

blocked at the view level, and not further propagated towards the query. Observe that there is no inconsistency here: originally, the designer configured the relation R to propagate its changes. Subsequently, the developers who have constructed view V and query Q, have set the view V to retain its semantics, independently of what happens to R. At the same time, other views over R might be re-adjusted to reflect the novel structure of R.

Query Module

View Module

Relation Module

Query Conditions

Query Attributes Query

View Conditions

View Attributes View

Relation Conditions

Relation Attributes Relation

User’s Choice

Fig. 5: Hierarchy for policy conflict resolution

The general guideline for handling policy conflicts follows the subsequent rule: the higher and left a module is at the hierarchy of Figure 5, the stronger its policy is. Specifically, we perform the following steps:

Step 1. Whenever a change takes place in a (lower level) module: (a) the change is applied to the module according to its policy; (b) all (higher level) modules of the graph that can possibly be affected by the change are determined; we will name these modules as candidates for readjustment; and (c) the change is propagated towards these candidates, or blocked.

Step 2. If an upper module does not have a policy to handle the propagated change, then it abides by the policy dictated by the lower level; e.g, if a view is modified and a query accessing it has no policy, then the query is aligned with the policy of the view.

Step 3. If an upper module has a policy of its own, then it overrides the policy dictated by the lower module. For example, if the addition of an attribute to a relation dictates the propagation of the change and a query accessing the relation has a policy block, then the query remains the same, independently of the relation’s policy.

Step 4. For elements belonging to the same module, the above guidelines also stand. For example, if the policy for an attribute deletion defined on an attribute dictates that the change is blocked and the policy for the same event defined on the relation dictates that the deletion is propagated then the attribute’s policy overrides the relation’s policy.

Step 5. Iterate over the next module. Assume, for example, the configuration of Figure 6. (Dotted lines are for the

constructs with policies that determine what happens on attribute deletion.) This

example involves a view, EmpsNR, characterizing employees near the age of retirement: SELECT * FROM EMP WHERE AGE>60. A query Q is defined over this view and returns employees near the age of retirement whose salary is high: SELECT * FROM EmpsNR WHERE SALARY>100K. Assume, now, that for the event of an attribute deletion, e.g., the Salary of EMP, the following policies have been defined on the graph. - On Salary attribute of EMP relation: ON attribute deletion THEN propagate. (The

node of Salary attribute is annotated.) - On EMP relation: ON attribute deletion THEN block. (The node of EMP relation is

annotated.) - On EmpsNR view: ON attribute deletion THEN propagate. (The FROM edge

between the view and EMP relation is annotated.) - On Q query: ON attribute deletion THEN block. (The FROM edge between the

query and EmpsNR view is annotated.) The above policies are defined on different graph elements; but, they capture the

same event. Following the guidelines and the hierarchy of Figure 5, the resolution of the policy conflict is performed as follows:

Q

Salary

S

EmpsNR

SalaryS

EMP

Salary

Smap select

map select>

W

100

From

block

block

propagate

propagate

Q

Salary

S

EmpsNR

EMP

map select>

W

100

FromInconsistency

due to dangling pointers!

Fig. 6: Example resolution of conflict of policies

- There is a conflict between the default policy of relation EMP: ‘block deletions of EMP attributes’ and the customized policy for attribute Salary: ‘propagate’. Based on the aforementioned method, the customized attribute policy overpowers the relation policy and propagates the attribute deletion to the relation. Attribute EMP.Salary is removed from the graph.

- The relation propagates the deletion to the view EmpsNR that has no conflict with the event ‘propagate’ and also removes the attribute EmpsNR.Salary from the graph.

The query Q is notified on the attribute deletion and adjusts itself according to its policy: ‘block’. The deletion is blocked from the subgraph of the query, with the result of attribute Q.SALARY and the condition SALARY>100K having dangling edges. Assuming that a rewriting possibility exists [8, 13] the query can be rewritten;

otherwise the designer observes that the planned change results in a syntactical error and an inconsistent graph.

4. Case Study

We have evaluated the effectiveness of our setting via the reverse engineering of real-world ETL processes, extracted from an application of the Greek public sector. We have monitored the changes that took place to the sources of the studied data warehouse. In total, we have studied a set of 7 ETL processes, which operate on the data warehouse. These processes extract information out of a set of 7 source tables, namely S1 to S7 and 3 lookup tables, namely L1 to L3, and load it to 9 tables, namely T1 to T9, stored in the data warehouse. The aforementioned scenarios comprise a total number of 53 activities.

Activities (53)

Source Name Event Type

Affected Autom. adjusted Prompt %

S1 Rename 14 14 0 100% Rename Attributes 14 14 0 100% Add Attributes 34 31 3 91% Delete Attributes 19 18 1 95% Modify Attributes 18 18 0 100%

S4 Rename 4 4 0 100% Rename Attributes 4 4 0 100% Add Attributes 19 15 4 79% Delete Attributes 14 12 2 86% Modify Attributes 6 6 0 100%

S2 Rename 1 1 0 100% Rename Attributes 1 1 0 100% Add Attributes 4 4 0 100% Delete Attributes 4 3 1 75%

S3 Rename 1 1 0 100% Rename Attributes 1 1 0 100%

S5 Modify Attributes 3 3 0 100% S6 NO_CHANGES 0 0 0 - S7 Rename 1 1 0 100% Rename Attributes 1 1 0 100%

L1 NO_CHANGES 0 0 0 - L2 Add Attributes 1 0 1 0% L3 Add Attributes 9 0 9 0% Change PK 9 0 9 0%

Table 3: Analytic results

Table 3 illustrates the changes that occurred on the schemata of the source and lookup tables, such as renaming source tables, renaming attributes of source tables, adding and deleting attributes from source tables, modifying the domain of attributes and lastly changing the primary key of lookup tables. After the application of these changes to the sources of the ETL process, each affected activity was properly

readjusted (i.e., rewriting of queries belonging to activities) in order to adapt to the changes. For each event, we counted: (a) the number of activities affected both semantically and syntactically, (b) the number of activities, that have automatically been adjusted by our framework (propagate or block policies) as opposed to those (c) that required administrator’s intervention (i.e., a prompt policy).

0

10

20

30

40

50

60

5 5 5 7 7 12 12

scenario size

# A

ctiv

ities

AffectedAdopted

0

2

4

6

8

10

12

14

16

1 2 3 4 5

complexity#a

ctiv

ities

AffectedAdopted

(a) (b)

Fig. 7: Evaluation of our method

Figure 7a depicts the correlation between the average number of affected activities versus automatically adapted activities w.r.t. the total number of activities contained in ETL scenarios. For ETL processes comprising a small number of activities most affected activities are successfully adjusted to evolution changes. For longest ETL processes, the number of automatically adjusted activities increases proportionally to the number of affected activities.

Furthermore, Table 3 shows that our framework can successfully handle and propagate evolution events to most activities, by annotating the queries included in them with policies. Activities requiring administrator’s intervention are mainly activities executing complex joins, e.g., with lookup tables, for which the administrator must decide upon the proper rewriting. Figure 7b presents the average amount of automatically adapted activities w.r.t. the complexity of activities. Complexity refers to the functionality of each activity; e.g., the type of transformation it performs, the types of queries it contains, etc. Simple activities are handled more efficiently by our framework. More complex activities, as pivoting activities, are also adequately adjusted by our approach to evolution changes.

5. Related Work

Evolution. A number of research works are related to the problems of database schema evolution. In [13] a survey on schema versioning and evolution is presented, whereas a categorization of the overall issues regarding evolution and change in data management is presented in [12]. The problem of view adaptation after redefinition is mainly investigated in [2, 5, 6], where changes in views definition are invoked by the user and rewriting is used to keep the view consistent with the data sources. In [6] the authors discuss versioning of star schemata, where histories of the schema are retained and queries are chronologically adjusted to ask the correct schema. [2] deals

also with warehouse adaptation, but only for SPJ views. [8] deals with the view synchronization problem, which considers that views become invalid after schema changes in the underlying base relations. Our work in this paper builds mostly on the results of [8], by extending it to incorporate attribute additions and the treatment of conditions. The treatment of attribute deletions in [8] is quite elaborate; we confine to a restricted version to avoid overcomplicating both the size of requested metadata and the language extensions. Still, the [8] tags for deletions can easily be taken into consideration in our framework. Note that all algorithms for rewriting views when the schema of their source data change (e.g., [2, 5]), are orthogonal to our approach. Due to this generality, our approach can be extended in the presence of new results on such algorithms.

Model mappings. Recently, model management [3, 4], provides a generic framework for managing model relationships, comprising three fundamental operators: match, diff and merge. Our proposal assigns semantics to the match operator for the case of model evolution, where the source model of the mapping is the original database graph and the target model is the resulting database graph, after evolution management has taken place. Velegrakis at al., have proposed a similar framework for the management of evolution. Still, the model of [13] is more restrictive, in the sense that it is intended towards retaining the original semantics of the queries. Our work is a larger framework that allows the restructuring of the database graph (i.e., model) either towards keeping the original semantics or towards its readjustment to the new semantics.

6. Conclusions and future work

In this paper we have discussed the problem of performing what-if analysis for changes that occur in the schema/structure of the data warehouse sources. We have modeled software modules, queries, reports and views as (sequences of) queries in SQL extended with functions. Queries and relations have uniformly been modeled as a graph that is annotated with policies for the management of evolution events. We have presented an algorithm that detects the parts of the graph that are affected by a given change and highlights the way they are tuned to respond to it. We have also presented a principled method for resolving conflicts. Finally, we have also assessed our proposal over cases extracted from real world scenarios. Future work can be directed towards many goals, with patterns of evolution sequences being the most prominent one.

References 1. J. Banerjee et al. Semantics and implementation of schema evolution in object-oriented

databases. In SIGMOD, 1987. 2. Z. Bellahsene. Schema evolution in data warehouses. In Knowledge and Information

Systems 4(2), 2002. 3. P. Bernstein, A. Levy, R. Pottinger. A Vision for Management of Complex Models. In

SIGMOD Record 29(4), 2000. 4. P. Bernstein, E. Rahm. Data Warehouse Scenarios for Model Management. In ER, 2000.

5. A. Gupta, I. S. Mumick, J. Rao, K. A. Ross. Adapting materialized views after redefinitions: Techniques and a performance study. In Information Systems (26), 2001.

6. M. Golfarelli, J. Lechtenbörger, S. Rizzi, G. Vossen, Schema Versioning in Data Warehouses, ECDM 2004, Pages 415 – 428.

7. M. Mohania, D. Dong. Algorithms for adapting materialized views in data warehouses. In CODAS, 1996.

8. A. Nica, A. J. Lee, E. A. Rundensteiner. The CSV algorithm for view synchronization in evolvable large-scale information systems. In EDBT, 1998.

9. G. Papastefanatos, P. Vassiliadis, Y. Vassiliou. Adaptive Query Formulation to Handle Database Evolution. In CAiSE Forum, 2006.

10. G. Papastefanatos, K. Kyzirakos, P. Vassiliadis, Y. Vassiliou. Hecataeus: A Framework for Representing SQL Constructs as Graphs. In EMMSAD, 2005.

11. Y.G. Ra, E.A. Rundensteiner. A transparent object-oriented schema change approach using view evolution. In ICDE, 1995.

12. J.F. Roddick et al. Evolution and Change in Data Management - Issues and Directions. In SIGMOD Record 29(1), 2000.

13. J.F. Roddick. A survey of schema versioning Issues for database systems. In Information Software Technology 37(7), 1995.

14. Y. Velegrakis, R.J. Miller, L. Popa. Preserving mapping consistency under schema changes. In VLDB J. 13(3), 2004.

15. R. Zicari. A framework for schema update in an object-oriented database system. In ICDE, 1991.

16. M. Blaschka, C. Sapia, G. Höfling: On Schema Evolution in Multidimensional Databases. In DaWaK, 1999.

17. C. Kaas, T. B. Pedersen, B. Rasmussen: Schema Evolution for Stars and Snowflakes. In ICEIS, 2004.

18. M. Bouzeghoub, Z. Kedad: A Logical Model for Data Warehouse Design and Evolution. In DaWaK, 2000.

19. M. Golfarelli, S. Rizzi, A. Proli: Designing what-if analysis: towards a methodology. In DOLAP, 2006.

20. A. Simitsis, P. Vassiliadis, M. Terrovitis, S. Skiadopoulos: Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates. In DaWaK 2005.

21. C.T. Liu, P.K. Chrysanthis, S.K. Chang. Database schema evolution through the specification and maintenance of changes on entities and relationships. In ER, 1994.

22. D. Tsichritzis, A.C. Klug. The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Database Management Systems. In Information Systems 3(3), 1978.

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/k/Klug:Anthony_C=.html

http://www.informatik.uni-trier.de/%7Eley/db/journals/is/is3.html#TsichritzisK78

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

What-if Analysis for Data Warehouse Evolutiongpapas/Publications/DataWarehouseEvoluti… · what-if...

Documents