Data Integration Services - Massachusetts Institute of … a data integration system facilitates...

Data Integration Services

Nesime Tatbul, Olga Karpenko, Christian Convey, Jue Yan

Technical Report

Brown University Computer Science Department

May 2001

Chapter 1

Data Integration Services

1 Introduction

With the prevalence of the network technology and the Internet, access to dataindependent of its physical storage location has become highly facilitated. Thisfurther has enabled users to access a multitude of data sources that are related insome way and to combine the returned data to come up with useful informationwhich is not physically stored in a single place. For instance, a person whohas the intension of buying a car can query several car dealer web sites andthen compare the results. He can further query a data source which providesinformation about car reviews to help his decision about the cars he liked. Asanother example, imagine a company which has several branches in differentcities. Each branch has its own local database recording its sales. Wheneverglobal decisions about the company have to be made, each branch database mustbe queried and the results must be combined. On the other hand, contactingdata sources individually and then combining the results manually every time aninformation is needed is a very tedious task. Instead, a service is needed whichprovides transparent access to a collection of related data sources as if thesesources as a whole constituted a single data source. We call such a service a dataintegration service and the system that integrates multiple sources to providethis service is usually referred to as a data integration system (Figure 1.1).

The main contribution of a data integration system is that users can focuson specifying what data they want rather than on describing how to obtainit. A data integration system relieves the user from the burden of finding therelevant data sources, interacting with each of them separately, then combiningthe data they return. To achieve this, the system provides an integrated viewof the data stored in the underlying data sources. Users can uniformly accessall the data sources as if they were querying a single data source. The access tothe integrated data is usually in the form of querying rather than updating thedata.

Furthermore, a data integration system facilitates decision support applica-

1

2 CHAPTER 1. DATA INTEGRATION SERVICES

Data Source Data Source

answer

Data Integration System

query

Data Source

Figure 1.1: Data Integration System

tions like OLAP (On-Line Analytical Processing) and data mining. OLAP is toperform financial, marketing or business analysis to be able to make businessdecisions on a collection of data from one or more data sources. The analy-sis is done through asking a large number of aggregate queries on the detaileddata. For example, the company in our previous example can easily developOLAP applications once its branch databases are integrated. Data Mining isdiscovering knowledge from a large volume of data. Statistical rules or patternsare automatically found from the raw collection of data. Data integration helpsbringing a large body of data together from multiple data sources that can beuniformly queried for knowledge discovery. Detailed information on data miningtechniques can be found in the Customization Chapter.

In this chapter, we discuss the issues involved in building and operatinga data integration system and provide a survey of existing solutions to theseissues.

1.1 Major Issues

Let us investigate the main stages involved in building and using a data integra-tion system to comprehend the major issues: design, modeling, and operation.

• A data integration system is basically an information system. Like allcomputer systems, its architecture has to be designed with data sourcesto be integrated being the major components. Usually, the data sourcesmust be integrated as they are without making any changes on their designand operation.

• Also, like all information systems, there is an application domain a dataintegration system has to model. This application domain is determinedby the underlying data sources and its modeling should be based on themodels of the data sources that make up the integration system.

• After its modeling and design, a data integration system has to be providedwith query functionality. This is again highly dependent on the underlyingdata sources’ query capabilities.

1. INTRODUCTION 3

Although the contents of the data sources are related in some way, they arelikely to show variety in many aspects. These differences make both the designand modeling phase and the operation phase of a data integration system verydifficult. The major issue in building a data integration system is resolvingthese differences between the data sources that may occur at different levels.This issue is generally referred to as heterogeneity of the data sources.

The data sources to be integrated may belong to the same enterprise (likethe company example), but might also be arbitrary sources on the World WideWeb (like the car buyer example). Most of the time, each of the sources isindependently designed for autonomous operation. Also, the sources are notnecessarily databases; they may be legacy systems which are old and obsoletesystems that are difficult to migrate to a modern technology or they may bestructured/unstructured files with different interfaces. Data integration requiresthat the differences in modeling, semantics and capabilities of the sources, withpossible data inconsistencies be resolved. More specifically, the major issuesthat make integrating such data difficult include:

• Heterogeneity of the data sourcesEach source to be integrated might model the world in its own way. Therepresentation of data of the similar semantics might be quite differentin each data source. For example, each might be using different namingconventions to refer to the same real world object. Moreover, they maycontain conflicting data. In addition to data representation and modelingdifferences, heterogeneity may also occur at lower levels including the ac-cess methods the sources are using, the operating systems underlying theindividual data sources, etc.

• Autonomy of the data sourcesUsually data sources are created in advance of the integrated system. Infact, most of the time they never know that they are part of an integration.They can make decisions independently and they can not be forced to actin certain ways. As a natural consequence of this, they can also changetheir data or functionality without any announcement to the outside world.

• Query correctness and performanceQueries to an integrated system are usually formulated according to theunified model of the system. These queries need to be translated into formsthat can be understood and processed by the individual data sources.This mapping should not cause incorrectness in query results. Also, queryperformance needs to be controlled as there are many factors which candegrade it. These include the existence of a network environment whichcan cause communication delays and the possible unavailability of the datasources for answering queries.


1.2 Chapter Outline

In the rest of this chapter, we discuss the above mentioned issues in more detail.The order of the subsections in the chapter roughly corresponds to the stagesinvolved in building and operating a data integration system. We start outpresenting the common approaches to architecting a data integration system inSection 2. Later, we discuss the semantic problems encountered in modeling anddata mapping stages of a data integration system in Section 3. Techniques forquerying the integrated data are presented in Section 4. The data extractionphase of querying where data is actually obtained from the data sources isdetailed in Section 5. We devoted Section 6 to the discussion of an importantissue in one particular type of data integration architecture: management ofmaterialized views in datawarehousing systems. This section completes ourdiscussion about the major problems and solutions. Finally, Section 7 concludesthe chapter.

2. DATA INTEGRATION ARCHITECTURES 5

2 Data Integration Architectures

The data sources can be organized in the integration system in many ways. Inthis section we introduce three main architectures of data integration systems:federated databases, mediation, and data warehousing. We group these ap-proaches based on whether the queries to the data sources are sent to the sourceswhen these queries arrive, or the results of the queries are pre-stored. The for-mer approach is a virtual approach and the latter is a materialized approachto data integration. We compare the approaches at the end of this section.We use three parameters to describe the characteristics of the sources of theseintegration systems: autonomy, heterogeneity, and distribution [Has00, OV99].

• AutonomyAutonomy indicates how independent the data sources are from the othersources and from the integrated system. According to Veijalainen andPopescu-Zeletin’s classification [MW88], there are three types of auton-omy:

– Design autonomyThe source is independent in data models, naming of the data ele-ments, semantic interpretation of the data, constraints etc.

– Communication autonomyThe source is independent in deciding what information it providesto the other components that are parts of the integrated system andto which requests it responds.

– Execution autonomyThe source is independent in execution and scheduling of incomingrequests.

• HeterogeneityHeterogeneity refers to the degree of dissimilarity between the componentdata sources that make up the data integration system. It occurs at differ-ent levels. On a lower level, heterogeneity comes from different hardwareplatforms, operating systems, and networking protocols. On a higher level,heterogeneity comes from different programming and data models as wellas different understanding and modeling of the same real-world concepts(i.e. naming of relations and attributes).

Logical heterogeneity can not be resolved automatically as it comes fromthe fact that different people present the same concept differently. Itinvolves both schematic and semantic heterogeneity. Schematic problemsare the differences in the elements that are used to represent some concept.For example, to store the information about voluntary student positionsin the University, one database developer may use the attributes namesfor each job (Tea Czar, Hospitality Czar) with true/false values foreach student; the other developer may model these jobs as values of the


attribute Job. Some of the semantic problems that arise are the interpre-tation of names and the difference in units used for the attributes. Wediscuss these issues in Section 3.

• DistributionDistribution refers to the physical distribution of data over multiple sites.Creating an integrated system and choosing the appropriate architecture,the designers should take into account the possible latency to communicatewith the data sources.

We further consider the most difficult case: fully-distributed and heterogeneoussystems with autonomous or semi-autonomous data sources. Metadata - theauxiliary data describing the main data - is maintained in the integrated systemsto deal with the problems caused by the heterogeneity. It can contain bothtechnical information about the sources (such as query capabilities and accessmethods), and also semantic information (such as the semantic connectionsbetween the relations, the domain dictionary specification) [BKLW99].

We describe the main architectural approaches to the design of the data in-tegration systems, and discuss some solutions to the issues caused by autonomy,distribution and heterogeneity.

2.1 Major Approaches to Data Integration

Two common approaches to integrate data sources are the following:

• Virtual View ApproachIn this case the data is accessed from the sources on-demand when a usersubmits a query to the information system. This is also called a lazyapproach to data integration.

• Materialized View/Warehousing ApproachSome filtered information from data sources is pre-stored (materialized) ina repository (warehouse) and can be queried later by users. This methodis also called an eager approach to data integration.

Sometimes a hybrid approach is used: integrated data is selectively materi-alized. The data is extracted from sources on-demand, but the results of somequeries are pre-computed and stored. In order to choose what queries to mate-rialize, designers should consider many factors, such as “popularity” of queriesand cost of maintenance [Ash00]. These issues are discussed in Section 6.

2.2 Virtual View Approach

Here we discuss two architectures for integrating data sources using a virtualview approach. They are federated database systems and mediated systems.


2.2.1 Federated Database Systems

A Federated Database System (FDBS) consists of semi-autonomous components(database systems) that participate in a federation to partially share data witheach other [SL90]. Each source in the federation can also operate independentlyfrom the others and the federation.

The components can not be called “fully-autonomous” because each com-ponent is modified by adding an interface that allows communication with allother databases in the federation.

Each of the component database systems can be either a centralized DBMS, adistributed DBMS, or another federated database management system, and mayhave any of the three types of autonomy mentioned above (design autonomy,communication or execution autonomy). As a consequence of this autonomy,heterogeneity issues become the main problem.

There are loosely coupled FDBSs and tightly coupled FDBSs.A tightly coupled FDBS has a unified schema1 (or several unified schemas)

which can be either semi-automatically built by schema integration techniques(see Section 3 for details) or created manually by the users. To solve the logicalheterogeneity, a domain expert needs to determine correspondences betweenschemas of the sources. A tightly coupled FDBS is usually static and difficultto evolve, because schema integration techniques don’t allow to add or removecomponents easily. An example of this kind of FDBSs is Mermaid [TBC+87].

A loosely coupled FDBS does not have a unified schema, but it providessome unified language for querying sources. In this configuration, componentdatabase systems have more autonomy, but humans must resolve all semanticheterogeneities. Requested data comes from the exporter of this data itself andeach component can decide how it will view all the accessible data in the feder-ation. As there is no global schema, each source can create its own “federatedschema” for its needs. Examples of such systems are MRSDM [Lit85], Omnibase[Rea89] and Calida [JPSL+88].

As pointed out by Heimbigner and McLeod [HM85], in order to remain au-tonomously functioning systems and provide mutually beneficent sharing of dataat the same time, components of FDBS should have facilities to communicatein three ways:

• Data exchangeThe components should be able to access the shared data of the othercomponents of the FDBS. This is the most important purpose of the fed-eration and good mechanisms of data exchange are a must.

• Transaction sharingThere may be cases where for some reason the component does not wantto provide direct access to some of its data, but can share operations

1Unified schema is the schema produced out of the schemas of the integration systemcomponents, after resolving all syntactic and semantical conflicts between these schemas.This schema allows users to query the integrated system as if it were one database.


on its data. Other components should have the ability to specify whichtransactions they want to be performed by another component.

• Cooperative activitiesAs there is no centralized control, cooperation is the key in federation.Each source should be able to perform a complex query involving accessingdata from other components.

The most naive way to achieve interoperability2 is to map each source’s schemato all others’ schemas. It is a so-called pair-wise mapping. You can see anexample of such federated database system in Figure 1.2. Unfortunately, itrequires n · (n − 1) schema translations and becomes too tedious with a largenumber of components in a federation. Research is being done on tools forefficient schema translation (See Section 3 for details).

We should note that the term “Federated Database Systems” is used dif-ferently in the literature: some researchers call only tightly coupled systemsFDBSs [BKLW99], some call only loosely coupled systems FDBSs [HM85], andsome take the same approach we did by considering tight and loose architecturesbe two kinds of federated database system architecture [SL90].

DB5

DB1

DB2

DB3

DB4

Figure 1.2: Example of federated database architecture

Federated architecture is very appropriate to use when there is a number ofautonomous sources, and we want, on one hand, to retain their “independence”allowing user to query them separately, and, on the other hand, allow them tocollaborate with each other to answer the query.

2.2.2 Mediated Systems

Mediated system integrates heterogeneous data sources (which can be databases,legacy systems, web sources, etc) by providing virtual view of all this data. Usersasking queries to the mediated system do not have to know about data source

2Interoperability here means the ability of each source to use the data of the other sources.


location, schemas or access methods, because such system presents one globalschema to the user (called mediated schema) and users ask their queries in termsof it.

A mediation architecture is different from a tightly coupled federation in thefollowing ways [SL90]:

• A mediated architecture may have non-database components

• The query capabilities of sources in a mediator-based system can be re-stricted and the sources do not have to support SQL-querying at all

• Access to the sources in a mediator-based system is usually read-only asopposed to read-write access in a FDBS (due to the fact that the sourcesin the mediator-based system are more autonomous) [BKLW99]

• Sources in a mediator-based approach have complete autonomy whichmeans it is easy to add or remove new data sources

Source 1

Wrapper Wrapper

Mediator

query query

query query

query

. . .Source n

Metadata

Figure 1.3: Mediated architecture (borrowed with some minor changes from[GMUW00])

A typical architecture for a mediated system is shown in Figure 1.3. Themain components of a mediated system are the mediator and one wrapper perdata source. The mediator (sometimes also called an integrator) performs thefollowing actions in the system:

1. Receives a query formulated on the unified (mediated) schema from a user.

2. Decomposes this query into sub-queries to individual sources based onsource descriptions.


3. Optimizes the execution plan based on source descriptions.

4. Sends sub-queries to the wrappers of individual sources, which will trans-form these sub-queries into queries over sources’ local models and schemas.Then the mediator receives answers to these sub-queries from wrappers,combines them into one answer and sends it to the user.

These steps are described in detail in Section 4.A wrapper hides technical and data model details of the data source from

the mediator. It is an important component of both a mediator-based archi-tecture and a data warehouse. Please refer to Section 5 for more informationabout wrappers.

ExampleLet us assume there are two data sources - two car dealer databases whichboth became parts of Acme Cars company. Each of the car dealers has a sepa-rate schema for storing information about cars. Dealer 1 stores it in the relation:

Cars(vin, make, model, color, price)

Dealer 2 stores information about his cars for sale in the relation:

CarsForSale(vehicleID, carMake, carModel, carColor, carPrice).

Acme Cars uses a mediated architecture to integrate these two dealers’ databases.It does this by providing a mediated schema of the two schemas above. Themediated schema consists of just one relation:

Automobiles(vin, autoMake, autoModel, autoColor, autoPrice).

Now if a client of Acme Cars submits an SQL-query:

SELECT vin, autoModel, autoColorFROM AutomobilesWHERE autoMake = "Honda" AND autoPrice < 14,000

The wrapper for the first database will translate this query to:

SELECT vin, model, color, yearFROM CarsWHERE make = "Honda" AND price < 14,000

It also renames model to autoModel and color to autoColor. The wrapper forthe second dealer will translate this query to:

SELECT vehicleID, carModel, carColor


FROM CarsForSaleWHERE carMake = "Honda" AND carPrice < 14,000

The wrapper also renames vehicleID to vin, carModel to autoModel andcarColor to autoColor.

Some known implementations of mediator-based architecture are: TSIMMIS(The Stanford-IBM Manager of Multiple Information Sources) [CGMH+94],Information Manifold [KLSS95], SIMS [AHK96], and Carnot [HSC+97].

2.3 Materialized View Approach (Data Warehousing)

In a materialized view approach, data from various sources is integrated byproviding a unified view of this data, like in a virtual view approach, but herethis filtered data is actually stored in a single repository (called data warehouse).A data warehouse is different from the traditional databases with OLTP (On-Line Transaction Processing) in the following ways [CD97]:

• It is mainly designed for decision support. As a consequence, a data ware-house often contains historical and summarized data. That also impliesthat users of a data warehouse are different than users of a traditionalDBMS: they will be analysts, knowledge workers, executives

• Workloads in warehouses are query intensive; queries are complex andquery throughput is more important than transaction throughput

• Information is usually read-only as opposed to read/write operations inOLTP.

There are three important steps involved in building and maintaining a datawarehouse:

• Modeling and design

In the stage of designing a warehouse, the developers need to decide whatinformation from each source they are going to use in the warehouse, whatviews (queries) over these sources they want to materialize, and what theglobal unified schema of the warehouse will be.

• Maintenance (refreshing)

Maintenance deals with how the warehouse is initially populated fromthe source data and how it is refreshed when the data in the sourcesare updated. View maintenance is a key research topic specific to datawarehousing and we discuss it in detail in Section 6.

• Operation

Operation of a data warehouse involves query processing, storage andindexing issues.


Data

Warehouse

Metadata

Integrator

Wrapper Wrapper

query

Data source 1 . . .

Data source n

Figure 1.4: Data warehouse architecture

Example of a data warehouse architecture is given in Figure 1.4.ExampleSuppose there is a company Cute Toys that owns two toy stores. There are twotypes of toys at each store: teddy-bears and dogs. Each store has a database,where they store a number of toys sold on each date, for each kind of a toy.Store 1 stores the relation: Sales(date, typeToy, numberSold) and store2 has two relations: TeddyBears(date, numberSold) and DogsToys(date,numberSold).

Now assume that the company would like to have the following relation inthe data warehouse for decision making purposes (future marketing):

ToySales(date, typeToy, numberSold)

In this case, the integrator needs to first select appropriate tuples from eachsource, take their union and then aggregate, so that for each date and type of atoy we have a total number of toys of this kind sold on a given date. The SQLquery to the first source is straightforward, as the relation is exactly the sameapart from the name it has. It will look the following:

INSERT INTO ToySales1(date, typeToy, numberSold)SELECT date, typeToy, numberSoldFROM Sales


For the second source, the integrator can ask two queries:

INSERT INTO ToySales2(date, typeToy, numberSold)SELECT date, "TeddyBear", numberSoldFROM TeddyBears

INSERT INTO ToySales2(date, typeToy, numberSold)SELECT date, "Dog", numberSoldFROM DogsToys

So, wrappers to sources 1 and 2 will return relations ToySales1 and ToySales2correspondingly. Now integrator component will join them summing the num-ber of toys of each kind sold on each date:

INSERT INTO ToySales(date, typeToy, numberSold)SELECT date, typeToy, SUM(numberSold)FROM ToySales1 s1, ToySales2 s2WHERE s1.typeToy=s2.type AND s1.date = s2.date

Some implementations of the data warehousing approach to data integrationinclude the Squirrel [HZ96] and WHIPS (WareHouse Information Prototype atStanford) [HGMW+95] systems.

We would like to note that the sources that are integrated always retain theirexecution autonomy.

2.4 Comparison of the architectures

The virtual view approach is preferable to the data warehousing in the followingcases:

• the number of data sources in an integrated system is very large and/orthe sources are likely to be updated frequently (like in the case of the websources),

• there is no way to predict what kind of queries the users will ask.

If, however, sources are permanent, don’t get upgraded too often and the design-ers of the integrated system know what kind of queries are to be expected mostoften, answers to these queries can be materialized. Also, if some sources arephysically located far away from the mediator, then accessing them each timea query is formulated may introduce undesired delays in response time. In thiscase, a data warehousing approach might be chosen to improve the performance.

Among the two architectures based on the virtual view approach (federationand mediation), mediated approach is chosen more often. As for the federa-tion, the systems with this architecture are not very common nowadays dueto the large number of interfaces that need to be written for each source tocommunicate with all the others.


A hybrid approach is usually discussed as a way to improve the performanceof some mediator-based systems. The approach to the data integration in thiscase is virtual, but some selected queries are materialized in a repository. Thisrepository then can serve as a new source for the mediated system. A hybridapproach is proposed in [Ash00], but otherwise is less commonly discussed inliterature than are data warehousing and mediation.

3. SCHEMA INTEGRATION 15

3 Schema Integration

A schema is a description of how data in a database appears to be structuredto users of the database. For example, in a relational database, the schemaspecifies what relations are in the database, what attributes are defined for eachrelationship, etc. In an object-oriented database, the schema specifies whatclasses are defined, what attributes and methods those classes have, etc.

Schema integration is the work that is performed, while constructing anintegrated information system, of reconciling the schemas of the different datasources into a single, coherent schema [JLYV00].

The product of schema integration is a (perhaps new) schema that can con-tain all of the information that is to be available from the integrated informationsystem. Various metrics exist for judging how good the integrated schema is,and are discussed in Section 3.3.

Schema integration can be a very easy or very difficult task, dependingon how many data sources are to be integrated, and on how differently theirschemas represent information. This section explores the issues that can makeschema integration so problematic, and describes what techniques have beendeveloped to deal with those problems.

3.1 Problems in Schema Integration

Schema integration problems can be broadly separated into two categories: theinformal problems arising from how humans organize themselves, and problemsin the formal realm of how schemas are represented.

3.1.1 Human Organizational Problems

Autonomous Data Sources When performing data integration, it is possi-ble that the people controlling the various data sources act fairly autonomouslywith respect to the people constructing the integrated system. Autonomousdata sources seems even more now than before the Internet became so popular,because the range of data sources available for integration is much larger thanbefore.

When a data source is managed by people who are autonomous from thepeople constructing the integrated system, various problems can arise for theschema integration task:

• Lack of Schema Information Sharing

The source data administrators might not be interested in, or may nothave the resources, to help the integrators to understand how their site’sschema relates to the schemas of other sites being integrated.

• Unannounced Schema Changes

The source data administrators might change their site’s schema with-out forewarning the integrators, leading the integration software to makeinvalid assumptions about the data source.


• Inconsiderate Schema Design

The data source administrators might choose a schema that is very difficultto integrate with the other schemas in the integrated system. In tightlycontrolled organizations, the various data source administrators might becoerced into all having easily integrated schemas. Such coercion is unlikelyto be possible in highly autonomous environments.

Complexity of the Set of Data Source Schemas Schema integrationis a knowledge-intensive task. It is conceivable that for some large systems,no one human would ever be able to understand the the schemas of all theconstituent data sources [Hal95]. This places a limitation on the human-orientedmethodologies that can be used to successfully integrate such systems [ND95].

3.1.2 Logical problems

These problems fit squarely in the realm of logics, formal languages, semantics,etc. These problems are the focus of much attention in schema integrationresearch and their formal nature lends them to attempted solutions involvinglogic, semantics, and knowledge representation.

Numerous incompatible taxonomies have been proposed for describing theproblems that can occur in schema integration. Several representative tax-onomies appear below.

The Taxonomy from [JLYV00] 3

• Heterogeneity Conflicts

Problems with the use of different data models in different schemas. Forexample, one schema may use an object oriented database, while the in-tegrated schema must be represented with a relational database.

• Naming Conflicts

Different schemas may use the same term to describe different concepts(homonyms) or two different terms to describe the same concept (syn-onyms).

• Semantic Conflicts

When different schemas use different levels of abstraction are used tomodel the same entity.

For example, one database might distinguish between “cars” and “trucks”,whereas another schema in the same integrated system might simply model“automobiles” and fail to store the car/truck distinction.

3It is claimed in [JLYV00] that consensus has been reached for using this taxonomy ratherthan competing taxonomies.


• Structural Conflicts

Different schemas may represent the same information in different ways.

For example, one car ownership schema may use a single table that storescar and owner information, while another schema may normalize the sameinformation into a “car” table and an “owner” table.

The Taxonomy from [Var99] This taxonomy is largely a refinement of[JLYV00]’s Heterogeneity Conflicts concept, but is still slightly incompatiblewith the other taxonomy. [Var99] offers this as a taxonomy of semantic incon-sistencies (e.g., semantic conflicts). However, this taxonomy includes NamingConflicts as a cause of semantic inconsistency, while [JLYV00] considers namingconflicts to be very distinct from semantic inconsistencies.

• Naming Conflicts

This is the same notion as Naming Conflicts from [JLYV00].

• Domain Conflicts

Different schemas use different simple values to represent data.

For example, one schema store care price as an integer number, whileanother might store a textual-rendition of the car’s price in a text string.

• Metadata Conflicts

A concept can be represented with the schema in one data source, but asregular (non-schema) data in another data source.

For example, one data source may distinguish between cars and trucks bymaintaining two separate tables, one for cars and one for trucks. Whichtable a record appears in specifies whether the vehicle is a car or a truck.Another data source may use a single table, but have a field in that tablethat indicates whether or not a row in the table represents a car or a truck.

• Structural Conflicts

This is the same notion as Structural Conflicts from [JLYV00].

• Missing Attributes

One schema may represent a superset of the information available in an-other schema.

For example, in two schemas that represent cars for sale, one schema mayinclude an attribute for the date of the car’s last oil change, whereas theother schema makes no provision for storing that information.

This issue is related to [JLYV00]’s Semantic Conflicts in the sense thatboth deal with differences in the level of detail about a the same entitythat two schemas can store.


• Different Hardware/Software

This conflict describes the fact that two information systems that arebeing integrated can have different hardware, operating systems, com-munications protocols, etc. Those differences can cause problems whenintegrating the two systems.

In our opinion, this is not a cause of semantic inconsistency when inte-grating the information systems. This is a more concrete, low-level issuethat has little to do with the semantics of the information systems.

The Taxonomy from [ND95] This work does not offer a full taxonomy ofschema integration problems, but does discuss one problem omitted from thetwo taxonomies listed above: recognition of object identity across different datasources/schemas.

Different data sources may attempt to provide information about the sameentity. Recognizing the instances where two or more data sources are in factboth describing the same entity can be problematic.

3.2 Representation of the Integrated Schema

The integrated schema will generally be represented in one of the following forms[LSS93].

3.2.1 Common Data Model

This is the design decision to choose a particular data model (such as relationalor object-oriented) in which to provide access to data in the integrated system.

Common Data Model(CDM) vs. Homogeneous Descriptions A ho-mogeneous description in an integrated system is that system’s single, unifiedschema [JLYV00].

This design choice of whether or not to use a CDM must not be confusedwith the whether or not to use a homogeneous description for the integratedsystem.

The concepts are distinct. CDM only specifies that some particular (per-haps unspecified) data model (i.e., object-oriented, or relational) will be used torepresent the integrated system. In contrast, a homogeneous description spec-ifies not only the data model to be used, but also the particular schema to beprovided by the integrated system.

CDM and homogeneous descriptions are similar, however, because higher-order logics are an alternative to each choice, as we will later see.

Integration Practices Associated with CDM The use of a CDM hastraditionally been paired with the development of a homogeneous descriptionfor the integrated system in a one-time effort [JLYV00].


The implementation of integrated systems using CDM also have some asso-ciation with the use of procedural languages, rather than declarative languages[CGL+98].

3.2.2 Description Logics

Description Logics (DLs) are languages used to represent knowledge in a par-ticular structured manner. A DL model uses the notions of concepts and rolesto represent basic ideas about the world [CLN99].

Concepts are unary predicates that specify the subset of some domain. Forexample, a concept might be a the notion of “car”, “truck”, “automobile”, or“automobile dealership”. Each of those concepts is a definition which includessome objects but excludes others.

Roles are binary predicates that can be used to express relationships betweenconcepts. For example, a role might be “for-sale-by”, that represents the binaryrelationship that can exist between a “car” and an “automobile dealership”.[CGL+98] describes a DL that also explicitly models n-ary predicates.

Description Logics in Schema Integration DLs can be used by softwareto reason about the semantics of data for when provided with basic semanticinformation [Bor95]. This makes them a powerful tool in computer-assisteddesign of integrated schemas, because DL-based reasoning can make the hu-mans designing the integrated system aware of certain relationships within andbetween schemas that they otherwise may have gone unnoticed.

The use of DLs for data integration advocated in [CGL+98] uses a DL tonot only model each data source, but also to express a model of a global domain.The global domain contains the set of concepts and roles that are used in theintegrated view of the system.

DL reasoning systems use a set of intermodel assertions [CGL+98] that hu-mans can state. These are assertions, expressed in terms of the already-definedconcepts and roles, express relationships between the concepts and roles of thedata sources in the integrated system, and between the data sources and theglobal domain model of the integrated system.

DL-based systems can do lots of automatic reasoning as data sources areadded to or removed from the integrated system. This automatic reasoning canreduce the effort invested and errors introduced by the humans designing theintegrated system.

Schema characteristics that DLs can identify include [Bor95]:

• Coherency of a Concept

Whether or not any element in a database could ever meet the require-ments for inclusion in the concept.

• Subsumption of One Concept by Another

Identifies which concepts will always have a superset/subset relationship.


• Mutual Disjointness of Two Concepts

Identifies whether or not the same object could ever meet the requirementsfor membership in both concepts.

• Equivalence of Two Concepts

Identifies whether or not two concepts that will always contain the exactsame set of elements.

Ability to Represent Schemas from Various Data Models One reasonthat DLs are a useful tool in reasoning about schemas is that DLs meetingcertain criteria are capable of representing the schemas of many popular datamodels, such as the entity-relation and object-oriented (sans the methods) mod-els [Bor95].

3.2.3 Other Formalisms for Schema Integration

Description Logics are not the only languages that can be used to aid in schemaintegration. See [HG92] and [ND95] for examples of such formalisms.

3.3 Quality Metrics for Integration Schemas

Various quality metrics for integrated schemas have been proposed:

• Accessibility - All data needed from the data sources to provide theintegrated view is in fact available from the present set of data sources[CGL+98].

• Believability - Warranting confidence that the data provided by the inte-grated system and/or data sources is consistent (in the Description Logicsense) and complete [CGL+98].

• Completeness [JLYV00]

• Consistency (in the Description Logic sense) of each data source [CGL+98]

• Correctness [JLYV00]

• Minimality [JLYV00]

• Understandability [JLYV00]

• Integration Transparency - In systems that use a Common Data Model,this is the ability of the integrated system to provide views of itself thatactually look like one of its constituent data sources [LSS93] 4.

• Information Capacity - The ability of an integrated schema to expressall of the information that the data source schemas can express [EJ95].

4One might consider this to be a feature that is present or absent from an integratedsystem, rather than a metric that can be given various scores.


• Readability - The integrated schema makes clear to humans the impor-tant relationships that are implied by the integrated schema [CGL+98].

• Redundancy - The recognition of equivalent concepts [CGL+98]

3.4 Steps in Schema Integration

Some attention has been paid to the steps that humans, and their software tools,go through in the design of an integrated schema.

3.4.1 The Overall Schema Integration Process

No general consensus of what the steps are is clear from a survey of academicliterature on the subject. Below are two different breakdowns that have beenproposed.

From [BF94], we have:

1. Pre-integration

This step involves:

• translating the data source schemas into the integrated system’s com-mon data model, and

• semantic enrichment [JLYV00] of the source schemas: recording ad-ditional semantic information about the schema in a semantic datamodel (such an entity-relationship model)

This is done for two reasons:

• Using one semantic data model for all data sources eliminates issuesthat arise from the data sources using different data models for theirschemas.For example, suppose two car dealerships are integrating their cus-tomer databases. One dealership’s database uses a relational schema,and the other users an object-oriented schema. When integrating thesystems, both of those schemas can first be translated into a semanticdata model, such as entity-relationship, so simplify reasoning aboutthe integration.

• The semantic data model can express the relationships between thedata source’s schema elements and the problem domain that couldnot be expressed by the data source schema’s data model. Having aformal representation of the additional semantic information is help-ful, and perhaps necessary, for producing a good integrated schema.Note that this additional information must be discovered by humans,since it may be simply absent from some data source schemas.


2. Comparison

This is the analysis of the the collection of data sources being integrated,looking for relationships between the elements of the various schemas.

This can be done at two levels: comparison of the schemas, and compari-son of the actual data in the data sources. Statistical reasoning techniques,such a fuzzy logic, might be used in these steps to guess at the relation-ships.

3. Integration

This is the construction of the integrated schema.

4. Schema Transformation

In contrast to [BF94], [JLYV00] offers the following sequence:

1. Pre-integration

This includes an early planning phase for the integration project, includingselection of the schemas to be integrated, and what order they will beintegrated in.

As with [BF94]’s pre-integration step, this step also includes semanticenrichment of the source schemas.

2. Schema Comparison

This is the analysis of the collection of source schemas to look for correla-tions and conflicts between them.

A partial list of conflicts that might be detected at this stage appears inSection 3.1.

3. Schema Conforming

This is the modification of source schemas to make them more suited forintegration with each other.

This includes the resolution the conflicts that were detected in the schemacomparison step, which still remains a partially manual step for humans.

[JLYV00] suggests that there are other besides conflict resolution mightlead to the modification of source schemas, but does not elaborate on whatthose reasons are.

4. Schema Merging and Restructuring

This step is where the (conformed) source schemas are finally tied togetherto form the integrated schema.

The resulting integrated schema can then be evaluated in terms of thequality metrics described in Section 3.3. The results of that quality anal-ysis can lead to further iteration of the schema integration to improve thequality of the integrated schema.


3.4.2 Processes for Performing Incremental Integration Steps whenUsing Higher-Order Logics

[CGL+98] describes the steps that can be taken when new data sources or newtype of queries are introduced to an integrated system that is integrated usinga higher-order logic (i.e., a description logic).

Source-Driven This is when a new data source is to be added to the inte-grated system. The steps to be taken are as follows.

1. Source Model construction

The information in the new data source is expressed in terms of the higher-order logic used by the integrated system.

2. Source Model integration

New intermodel assertions are recorded that relate the new data source tothe other data sources and to the global domain model.

Conflicts that are made apparent after these assertions are recorded arealso dealt with at this step.

3. Quality Analysis

This is the assessment of the quality of the integrated schema. The out-come of this assessment may lead to the repetition of some earlier stepsin this sequence, or even in a reconsideration of the global domain model.

4. Source Schema specification

Recall that description logics may be used only at design time to supportthe software tools that help humans to design the integrated schema anddevelop query plans.

At runtime, the description logics may go unused, and a traditional schema(i.e., relational) must be used to access the data source.

This step is the construction of a new view of the data source that:

• is in a schema language usable by the system at runtime, and

• offers a view of the data source that was designed during the earliersource model integration step.

5. Materialized View Schema restructuring5

The new data source may have introduced new kinds of information to theintegrated system. When the integrated system uses materialized views,those views may need to be restructured to be able to express the newlyavailable information.

5Only applicable when the integrated system uses materialized views (see Section 6).


Client-Driven Integration This is when a query must be supported by theintegrated system, but no execution plan has yet been formulated for that par-ticular query.

To accommodate this event, humans can use software tools that reason aboutthe integrated system’s DL to determine whether or not the query can be an-swered using data source views that are already established.

See [CGL+98] for more specific details on how the reasoning software canhelp when the integrated system uses materialized views.

3.5 Schema Integration Tools

3.5.1 Available Tools

Based on a survey of academic literature and on the author’s familiarity withindustrial solutions for data warehousing, the set of tools for assisting withschema integration appears to be largely academic.

An excellent overview of key academic systems for schema integration canbe found in [JLYV00].

3.5.2 Benefits of Using Schema Integration Tools

Schema integration tools are good for performing a great deal of reasoningabout an integrated system, as long as humans have provided the informationthat these systems need in an appropriate language.

In particular, the tools can reduce the required human effort needed to in-tegrate schemas by:

• identifying and resolving some schema conflicts [Hal95]

• identifying relationships between the data that are stored in differentsources that have different schemas [Hal95]

• optimizing the integrated schema in terms of consistency, redundancy, andtype checking [Bor95]

• helping humans know how to rewrite newly-encountered queries [CGL+98]

• determining whether or not existing data sources are capable of answeringa query [CGL+98]

3.6 The State of the Art

Schema integration is still an activity that involves humans, primarily at twosteps:

• Schema enrichment of data sources

This activity may involve research by people to add information aboutsource schemas that was never recorded in the schema, or perhaps even inwritten documents.


• Conflict resolution

When schema integration tools detect certain conflicts in how data sourcesand/or the global domain model express information, human judgementis currently needed to decide what to do about the problem.

A trend in research appears to be efforts to reduce the need for humaninvolvement in the process. For the time being, however, schema integrationcan labor intensive.


4 Querying the Integrated Data

The main purpose of building data integration systems is to facilitate the accessto the multitude of data sources. The ability to correctly and efficiently processthe queries to the integrated data lies in the heart of the system. The traditionalway of query processing involves the following basic steps:

1. getting a declarative query from the user and parsing it

2. passing it through a query optimizer which produces an efficient queryexecution plan that describes how to exactly evaluate the query, i.e., applywhich operators, in what order, using what algorithm

3. executing the plan on the data physically stored on disk

The procedure described above also applies to query processing in data in-tegration systems in general terms. However, the task is more challenging dueto the complexities brought by the existence of multiple sources with differingcharacteristics. First of all, we need to decide which sources are relevant tothe query and hence should participate in query evaluation. These chosen datasources will participate in the process by their own query processing mecha-nisms. Second, due to potential heterogeneity of the sources, there may existvarious access methods and query interfaces to the sources. In addition to beingheterogeneous, the sources are usually autonomous as well and therefore not allof the them may provide full query capability. Third, the sources might con-tain inter-related data. There may be both overlapping and inconsistent data.Overlapping data may lead to information redundancy and hence unnecessarycomputations during query evaluation. Especially in the case where there is alarge number of sources and the probability of overlap is high, we may need tochoose the most beneficial sources for query evaluation. The last but not theleast, the sources may be incomplete in terms of their content. Therefore, itmay be impossible to present a complete answer to user’s query. This list ofcomplications is extensible.

As discussed in Section 2, a data integration system may be built in twomajor ways: by defining a mediated schema on the participating data sourceswithout actually storing any data at the integration system (virtual view ap-proach) or by materializing the data defined by a unified schema at the integra-tion system (materialized view approach). In both of the approaches, the userquery is formulated in terms of the schema of the integrated system. However,in the latter approach, since the data is stored at the integration system accord-ing to the unified schema, query evaluation is no more difficult than traditionalway of query processing. The major issue there, is the synchronization of thematerialized data with the changes to the original data at the data sources,i.e., maintenance of the materialized views. We discuss this issue in Section 6.During maintenance, views defined on the data sources have to be processedon the data sources to re-materialize the updated data. In other words, queryprocessing on the original data sources is realized usually at a different time

4. QUERYING THE INTEGRATED DATA 27

than the user’s query being processed on the materialized views. On the otherhand, in the virtual view approach, every time a user asks a query, data sourceaccess is required. Therefore, query processing for the virtual approach includesthe issues that would arise for the maintenance stages of the materialized viewapproach. In this regard, we discuss mainly the query processing problem forthe virtual view approach in this section.

In the rest of this section, first we briefly discuss the modeling issues whichforms the basis of all the following arguments. Then we present the main stagesin query processing in data integration systems in order, namely, query refor-mulation, query optimization and query execution.

4.1 Data Modeling and Mapping

Traditionally, to build a database system, we first model the requirements ofthe application and design a schema to support the application. In a dataintegration system, rather than starting from scratch, we have a set of pre-existing data sources which would form the basis of the application. However,each of these data sources may have different data models and schemas. In otherwords, each source presents a partial view of the application in its own way ofmodeling. In fact, if we were to design a database system for the applicationstarting from scratch, we would have another model, which would have thecomplete and ideal view of the world. To simulate this ideal, we need to designa unifying schema in a single data model based on the schemas of the datasources being integrated. Then each source needs to be mapped to relevant partsof this unified schema. This single schema of the integrated system is called the”mediated schema”. Having a mediated schema facilitates the formulation ofqueries to the integrated system. The users simply pose queries in terms of themediated schema, rather than directly in terms of the source schemas. Althoughthis is very practical and effective in terms of transparency of the system to theuser, it brings the problem of mapping the query in mediated schema to one ormore queries in the schemas of the data sources.

Figure 1.5 shows the main stages in query processing in data integrationsystems. There is a global data model that represents the data integrationsystem and each of the data sources has its own local data model. There aretwo conceptual translation steps: (i) from the mediated schema to exportedsource schemas, (ii) from exported source schemas to source schemas. Thedifference comes from the data models used. In the former one, the user queryis reformulated as queries towards individual sources, but they are still in theglobal data model. In the latter one, source queries are translated into a formthat is understandable and processable by the data sources directly, i.e., datamodel translation is achieved in this latter step. These two steps are performedby the mediator and the wrapper components in the system, respectively. Inthis section, we will be focusing on the operation of the mediator and the detailsof the wrapper will be presented in Section 5.

As Figure 1.5 indicates, in addition to modeling the mediated schema, weneed to model the sources so that we can establish an association between the


Descriptions

Query

MediatedSchema

Source

SourceStatistics

WrapperWrapperWrapper

Query (in mediated schema)

QueryReformulation

Optimization

ExecutionEngine

Query

logical plan(source queries in exported source schemas)

physical plan(distributed query execution plan)

source queryin exportedsource schema

SourceDataSource Source

Data

query insource schema

global data

model

local data models

Data

Figure 1.5: Stages of Query Processing [Lev99b]

relations in the mediated schema and the relations in the source schemas. Thisis achieved through source descriptions. The description of a source shouldspecify its contents and constraints on its contents. Moreover, we need to knowthe query processing capabilities of the data sources. Because in general, in-formation sources may permit only a subset of all possible queries over theirschemas. Source capability descriptions include which inputs can be given tothe source, minimum and maximum number of inputs allowed, possible outputsof the source, selections the source can apply and acceptable variable bindings[LRO96].

In Figure 1.5, first, using the mediated schema and the source descriptions,user query is reformulated into source queries in exported source schemas. Anexported source schema refers to translated source schema in the global datamodel. These source queries provide a logical plan to the query optimizer whichlater produces a physical query execution plan using some source statistics.Afterwards, the physical plan is executed by the query execution engine throughcommunicating with the data sources through their wrappers. Although it isnot shown in this figure, the query execution engine later collects the results


from the sources which are then combined for presentation to the user.To be able to present the methods for querying the integrated data, we need

to choose a data model and language to express the mediated schema, sourcedescriptions and the queries. Due to its simplicity for illustrating the concepts,we will be using relational model as our global data model and Datalog as ourlanguage.

4.1.1 Datalog

We can express queries and views as datalog programs. A datalog programconsists of a set of rules each having the form:

q(X) : −r1(X1), . . . , rn(Xn)

where q and r1, . . . , rn are predicate names and X, X1, . . . , Xn are either vari-ables or constants. The atom q(X) is called the head of the rule and the atomsr1(X1), . . . , rn(Xn) are called the subgoals in the body of the rule. It is assumedthat each variable appearing in the head also appears somewhere in the body.That way, the rules are guaranteed to be safe, meaning that when we use arule, we are not left with undefined variables in the head. The variables inX are universally quantified and all other variables are existentially quantified.Queries may also contain subgoals whose predicates are arithmetic comparisons.A variable that appears in such a comparison predicate must also appear in anordinary subgoal so that it has a binding.

Predicates that represent relations stored in the database are called EDB(Extensional DataBase) predicates and predicates whose relation is constructedby the rules are called IDB (Intensional DataBase) predicates. In the above rule,q is an IDB predicate. If all ri are EDB predicates, then we have a conjunctivequery. A conjunctive query has the following semantics: We apply the rule forthe query to the EDB relations by substituting values for the variables in thebody of the rule. If a substitution makes all the subgoals true, then the samesubstitution applied to the head, is an inferred fact about the head predicateand the answer to the query [Ull97]. In this section, we will be consideringconjunctive queries.

4.1.2 Modeling the Data Sources

To reformulate a query in mediated schema as a set of queries that are writtenin terms of the source schemas, we need the relationship between the relationsin the mediated schema and the source relations. This is achieved throughmodeling the sources using source descriptions.

There are three approaches to describing the sources [Fri99]:

Global As View (GAV) ApproachFor each relation R in the mediated schema, a view in terms of the sourcerelations is written which specifies how to obtain R’s tuples from thesources.


ExampleThe following simple example shows how mediated schema relations CARand REVIEW can be obtained from the source relations S1, S2 and S3.

S1(vin, status, model, year) ⇒ CAR(vin, status)

S2(vin, status, make, price) ⇒ CAR(vin, status)

S1(vin, status, model, year) ∧ S3(vin, review) ⇒ REVIEW(vin, review)

S2(vin, status, make, price) ∧ S3(vin, review) ⇒ REVIEW(vin, review)

This approach was taken in the TSIMMIS System [CGMH+94].

Local As View (LAV) ApproachFor each data source S, a view in terms of the mediated schema relationsis written that describes which tuples of the mediated schema relationsare found in S.

ExampleIn LAV, we take an opposite approach to GAV and we describe each sourcein terms of the mediated schema relations. Assume that source S1 con-tains cars produced after 1990 and source S2 contains cars sold by thedealer "ACME".

S1(vin, status, model, year) : − CAR(vin, status),

MODEL(vin, model, year), year ≥ 1990

S2(vin, status, make, price) : − CAR(vin, status),

MODEL(vin, make, year), SELLS(dealer name, vin, price),

dealer name = "ACME"

S3(vin, review) : − REVIEW(vin, review)

Query processing using the LAV approach is an application of a muchbroader problem called ”Answering Queries using Views”. We will furtherdiscuss this problem in the next section.

One of the systems that used this approach was the Information ManifoldSystem [KLSS95].

Description Logics (DL) ApproachDescription Logics are languages designed for building schemas based onhierarchies of collections. In this approach, a domain model of the applica-tion domain is created. This model describes the classes of information inthe domain and the relationships among them. All available informationsources are defined in terms of this model. This is done by relating theconcepts defining the information sources to appropriate concepts defin-ing the integrated system. Queries to the integrated system are also askedin terms of this domain model. In other words, the model provides alanguage or terminology for accessing the sources.


DL approach is similar to LAV in that a view that describes each sourceis written except that views are formulated not in terms of a mediatedschema, but on concepts and classes from the application domain model.Queries are also formulated in the same way.

This approach was taken in the SIMS System [AHK96].

Each of these approaches has certain advantages and disadvantages overthe others [Lev99b]. The main advantage of GAV is that query reformulationin GAV is very easy. Since the relations in the mediated schema are definedin terms of the source relations, it is enough to unfold the definitions of themediated schema relations. Another advantage is the reusability of views asif they were sources themselves to construct hierarchies of mediators as in theTSIMMIS System [CGMH+94]. However, it is difficult to add a new source tothe system. It requires that we consider the relationship between the new sourceand all the other sources and the mediated schema and then change the GAVrules accordingly. Query reformulation in LAV is more complex 6. However,LAV has important advantages compared to GAV: adding new sources andspecifying constraints in LAV are easier. To add a new source, all we need todo is describe that source in terms of the mediated schema through one or moreviews. We do not need to consider the other sources. Moreover, if we want tospecify constraints on the sources, we simply add predicates to the source viewdefinitions.

Compared to GAV and LAV approaches, DL approach has the benefit ofpresenting the user a richer domain model with hierarchical structures. Sincethe source relations and the mediated schema relations are parts of the samedomain model, mapping between them is facilitated. However, DL by itself isnot expressive enough to model arbitrary joins of relations [Lev99b]. As in LAVapproach, adding new data sources is easy in DL approach. However, if thecontents of the new source can not be completely mapped to the domain model,then the domain model has to be extended [AKS96].

4.1.3 Using Probabilistic Information

The source descriptions that have been mentioned up to now consider sources inisolation. However, the sources may be related. Moreover, they have the under-lying assumption that sources are complete. For example, in a previous example,we considered that source S1 contains cars produced after 1990. All the cars inS1 are produced after 1990 for sure but we do not know whether all the carsproduced after 1990 exist there. Therefore, in addition to the qualitative sourcedescriptions as discussed in the previous subsection, we also need quantitativedescriptions about the correlation and incompleteness of the sources [FKL97].Qualitative descriptions allow us distinguish irrelevant sources. Quantitativedescriptions help us distinguish among the relevant sources the ones which havehigher probability to contain the answers.

6As we shall see in the next section, the most important work done on query reformulationfocus on the LAV approach.


[FKL97] categorizes the quantitative information needed into three and presentshow each can be specified using probabilities:

• coverage (completeness) of the sourcesIt specifies the degree to which sources cover what their qualitative descrip-tion suggest. This is done through specifying the probability of findingcertain data items in the source. For instance, if S1 is believed to cover90% of all the cars produced after 1990, then this probability will be 0.9.

• overlap between parts of the mediated schemaIt specifies the degree of overlap between the parts of the mediated schemaand hence indirectly the overlap between the data sources. For example,probability that a car is a Japanese car given that it is economic in gas maybe assigned a value so that if we know that a car has low gas consumption,then we can infer that it is a Japanese car with some confidence.

• overlap between information sourcesThis is to correlate the source contents. It can be derived from the othertwo categories or can be explicitly stated. For example, the probabilitythat a car contained in S1 is also contained in S2 may be 0.9, which isapproximately equivalent to saying that S1 is a subset of S2.

This kind of probabilistic information can be very useful to optimize queryprocessing. The sources that have higher probability of containing an answerto a query may be given priority in access. [VP98] also includes a similar studyon using probabilistic information in data integration systems.

4.2 Query Reformulation

Using the source descriptions, a user query written in terms of the mediatedschema is reformulated into a query that refers directly to the schemas of thesources (but still in the global data model). There are two important criteriato be met in query reformulation [Lev99a]:

• Semantic correctness of the reformulation: The answers obtained from thesources will be correct answers to the original query.

• Minimizing the source access: Sources that can not contribute any answeror partial answer to the query should not be accessed. In addition toavoiding access to redundant sources, we should reformulate the queriesas specific as possible to each of the accessed sources to avoid redundantquery evaluation.

In this section, we will mainly discuss query reformulation techniques forthe LAV approach of source modeling. The reason for this is that query refor-mulation in LAV is not straightforward and also it is one of the applicationsof an important problem called ”Answering Queries using Views”. In what fol-lows, first we briefly summarize this problem together with its other importantapplications. Then we present various query reformulation algorithms for LAV.


4.2.1 Answering Queries Using Views

Informally, the problem is defined as follows: Given a query Q over a databaseschema, and a set of view definitions V1, . . . , Vn over the same schema, rewritethe query using the views as Q′ such that the subgoals in Q′ refer only to viewpredicates. If we can find such a rewriting of Q into Q′, then to answer Q, it isenough that we answer Q′ using the answers of the views [Lev00].

Interpreted in terms of the query reformulation problem for the LAV ap-proach, this means the following: By using the views describing the sources interms of the mediated schema, we can answer a user query written in termsof the same schema by rewriting the query as another query referring to theviews rather than the mediated schema itself. Each view referred by the newquery can be evaluated at the corresponding source this way. Basically we aredecomposing the query into several subqueries each of which is referring to asingle source.

Answering queries using views has many other important applications whichinclude query optimization, database design, data warehouse design and seman-tic data caching [Lev00]. For example, query optimization may be achieved byusing previously materialized views for answering a query in order to save fromrecomputation. We are discussing data warehouse design issues in Section 6.

The ideal rewriting we expect to find would be an ”equivalent” rewriting.However, this may not always be possible. In data integration systems in partic-ular, source incompleteness and limited source capability would lead to rewrit-ings that approximate the original query. Among the many possible approxi-mate rewritings, we need to find the ”best” one. The technical term for this bestrewriting is ”maximally-contained” rewriting. The below definitions formalizethese terms [Lev00]:

Query Containment and Equivalence A query Q′ is contained in anotherquery Q if, for all databases D, Q′(D) is a subset of Q(D). A query Q isequivalent another query Q′ if Q′ and Q are contained in one another.

Equivalent Rewritings Let Q be a query and V = V1, . . . , Vm be a set ofview definitions. The query Q′ is an equivalent rewriting of Q using V if:

• Q′ refers only to the views in V , and

• Q′ is equivalent to Q.

Maximally-contained Rewritings Let Q be a query and V = V1, . . . , Vm

be a set of view definitions in a query language L. The query Q′ is amaximally-contained rewriting of Q using V with respect to L if:

• Q′ refers only to the views in V ,

• Q′ is contained in Q, and

• there is no rewriting Q1 such that Q′ ⊆ Q1 ⊆ Q and Q1 is notequivalent to Q′.


4.2.2 Completeness and Complexity of Finding Query Rewritings

Theoretical issues related to the problem of finding query rewritings using viewsinclude completeness and complexity of the query rewriting algorithms. We willbriefly touch on these issues here and we refer the interested readers to [Lev00]for a detailed discussion.

Completeness of a query rewriting algorithm is defined as follows in [Lev00]:Given a set of views V and a query Q, will the query rewriting algorithm al-ways find a rewriting of Q using V if there exists such a rewriting? The answerto this question also depends on the query language used to express the queryrewritings. Sometimes the limited expressiveness of the language may preventthe algorithm from finding a query rewriting although there exists one. In thecase that no equivalent query rewriting exists, then we try to find a maximally-contained rewriting. [Lev00] also points out that sometimes we need to userecursive Datalog rules to be able to come up with a maximally-contained rewrit-ing. This exemplifies the dependence of the algorithms on the expressiveness ofthe query language.

The complexity of the query rewriting algorithms can be discussed underdifferent language and modeling assumptions. In general, they are in NP. Pleaserefer to [Lev00] for a discussion of the specific cases.

4.2.3 Reformulation Algorithms

Given a query Q and a set of views V1 . . . Vn, to rewrite Q in terms of Vis, wehave to perform an exhaustive search among all possible conjunctions of m orless view atoms where m is the number of subgoals in the query. The followingalgorithms propose alternative ways of finding query rewritings to avoid theexhaustive search.

The Bucket Algorithm (Information Manifold)The main idea underlying the Bucket Algorithm [Lev00] is that we canreduce the number of query rewritings that need to be considered if weconsider each subgoal in the query separately to determine which viewsmay be relevant to each subgoal. Given a query Q, the Bucket Algorithmfinds a rewriting of Q in two steps:

1. The algorithm creates a bucket for each subgoal in Q which containsthe views (i.e., data sources) that are relevant to answering thatparticular subgoal.

2. The algorithm tries to find query rewritings that are conjunctivequeries, each consisting of one conjunct from every bucket. For eachpossible choice of element from each bucket, the algorithm checkswhether the resulting conjunction is contained in the query Q orwhether it can be made to be contained if additional predicates areadded to the rewriting. If so, the rewriting is added to the answer.Hence, the result of the Bucket Algorithm is a union of conjunctiverewritings.


The following simple example shows how the algorithm works:ExampleConsider the car-dealer example we presented earlier. Assume that thereare three data sources S1, S2 and S3. S1 contains information about carsproduced after 1990. S2 contains cars sold by the dealer named "ACME".S3 contains car reviews. Assume that we have the following relations inthe mediated schema:

CAR(vin, status)

MODEL(vin, model, year)

SELLS(dealer_name, vin, price)

REVIEW(vin, review)

Furthermore, we have the following view definitions for the data sources:



S2(vin, status, model, price) : − CAR(vin, status),

MODEL(vin, model, year), SELLS(dealer name, vin, price),

dealer name = "ACME"


Assume that we are looking for used cars produced before 1990, theirreviews and where they are sold. We pose the following query to themediated system:

Q(vin, dealer, review) : − CAR(vin, status), MODEL(vin, model, year),

SELLS(dealer name, vin, price), REVIEW(vin, review),

year < 1990, status = "used"

We will use the initial letters of the fields for ease of presentation. Thefirst step of the Bucket Algorithm constructs the following buckets persubgoal in Q:

CAR(V, S) MODEL(V, M, Y) SELLS(D, V, P) REVIEW(V, R)

S2(V, S, M’, P’) S2(V, S’, M, P’) S2(V, S’, M’, P) S3(V, R)

Notice how views are mapped to each query subgoal by the buckets. It isimportant to note that we did not insert S1 into buckets CAR(V, S) andMODEL(V, M, Y) because of the constraint on the year attribute in thequery. Since S1 contains cars which are produced after 1990 and the queryasks for the ones produced before 1990, S1 can not answer the query.

The second step of the algorithm chooses one view from each bucket andcombines them into a new query. Since for this simple example we havealready one entry per bucket, there will be one combination of views. Ingeneral, we would have to construct one query per possible combination ofthe entries and we would test for containment in the original query. Thenthe result would be the union of all the contained queries.


We obtain the following new query written in terms of the view definitionsrather than mediated schema relations:

Q’(vin, dealer, review) : − S2(vin, status, model, price),

S3(vin, review), year < 1990, status = "used"

Note that we eliminated two redundant references to view S2 and wealso added the extra constraints on the year and status attributes sincewithout these predicates, Q’ would not be contained in Q.

In terms of completeness and complexity, [Lev00] mentions that the BucketAlgorithm is guaranteed to find maximally-contained rewriting of a queryif the query does not contain arithmetic comparison predicates. However,the second phase may take exponentially long.

The Inverse-Rules Algorithm (InfoMaster)The key idea underlying this algorithm is to construct a set of rules thatinvert the view definitions, i.e., rules that show how to compute tuples forthe mediated schema relations from tuples of the views [Lev00]. One canthink of this process as obtaining GAV definitions out of LAV definitions.In other words, we are not actually rewriting the query, but we are rewrit-ing the view definitions so that the original query can be easily answeredon the rewritten rules.

One inverse rule is constructed for every subgoal in the body of the view.While inverting the view definitions, the existential variables that appearin the view definitions are mapped using Skolem functions to ensure thatthe value equivalences between the variables are not lost. The followingexample illustrates the algorithm:

ExampleConsider the view definition for S1 in the previous examples:



Inverse-Rules Algorithm inverts this view definition by writing one inverserule for every subgoal in the view definition as below:

CAR(f1(V, status, model, year), status) : − S1(V, status, model, year)

MODEL(f1(V, status, model, year), model, year) : −S1(V, status, model, year)

As illustrated above, the attribute vin is replaced by a skolem functionf1 which takes all the attributes of the view head as input. The attributecorresponding to vin is changed to a variable V. The reason that we treatvin as a special attribute is that it is shared between CAR and MODEL inthe view definition. That is, a vin value in CAR should also exist in MODELto take place in the view S1. f1 makes sure that they are mapped to thesame value.

The rewriting of a query Q using the set of views V is the datalog programthat includes the inverse rules for V and the query Q. Below we show howa query is evaluated using these rules.


Q(vin) : − CAR(vin, status), MODEL(vin, model, 2000)

Assume that the source that is defined by S1 contains the following data:

S1 = {(1, "used", "Honda", 2000), (2, "new", "Toyota", 2001),(3, "used", "Subaru", 2000)}

Then the algorithm would compute the following tuples:

{CAR(f1(1, "used", "Honda", 2000), "used"),

CAR(f1(2, "new", "Toyota", 2001), "new"),

CAR(f1(3, "used", "Subaru", 2000), "used"),

MODEL(f1(1, "used", "Honda", 2000), "Honda", 2000),

MODEL(f1(2, "new", "Toyota", 2001), "Toyota", 2001),

MODEL(f1(3, "used", "Subaru", 2000), "Subaru", 2000)}When Q is evaluated on those tuples, we would obtain the answer {1, 3}.In terms of completeness, this algorithm is guaranteed to find a maximally-contained rewriting in polynomial time in the size of the query and theviews [Lev00].

Note that this example also illustrates how the rules in GAV approach canbe used to evaluate the queries.

The MiniCon AlgorithmMiniCon Algorithm is an improved version of the Bucket Algorithm. Asin the Bucket Algorithm, there are two steps: computing the buckets,one for each subgoal in the query, and then computing the rewritingsusing the buckets. Additionally, MiniCon Algorithm pays attention tothe interaction of the variables in the query and in the view definitionsto prune some of the views to be added into the buckets. This way, thenumber of views to be considered for the rewriting step is reduced, i.e.,there will be less number of combinations to check.

The following example clarifies the algorithm.

ExampleConsider the following view definitions and the query:


MODEL(vin, model, year)

S2(vin, status, model, price) : − CAR(vin, status),

MODEL(vin, model, year), SELLS(dealer name, vin, price)


Q(vin, dealer, review) : − CAR(vin, status), MODEL(vin, model, year),

SELLS(dealer name, vin, price), REVIEW(vin, review)

The original Bucket Algorithm would put S1 into buckets of CAR andMODEL. However, S1 can not be used in the rewriting of Q for the followingreason: Q requires join on SELLS and REVIEW. To use S1, either it should


contain also SELLS and REVIEW or it should have appropriate variables inthe head so that it can be joined with other views that contain SELLS andREVIEW. S1 must have variable vin to be joined with SELLS and REVIEWon vin attributes of the other views. However, S1’s head only containsmodel and year. Therefore, we can not use S1 in rewriting. There is noneed to put it in the buckets of CAR and MODEL subgoals. S2 will go intothe buckets of CAR, MODEL and SELLS and S3 will go into the bucket ofREVIEW. The new query will be:

Q’(vin, dealer, review) : − S2(vin, status, model, price),

S3(vin, review)

Pruning S1 in the bucket construction step, we need to check less numberof view combinations to rewrite Q′ that is contained in Q.

For a detailed discussion on the MiniCon Algorithm, please see [PL00].

The Shared-Variable-Bucket AlgorithmThis algorithm, like the MiniCon Algorithm, also aims at recovering theweak aspects of the Bucket Algorithm to obtain a more efficient algorithm.The idea is again to examine the shared variables and reduce the bucketcontents so that the number of view combinations to be considered isreduced at the second phase of the algorithm. We are not describing thisalgorithm in detail. Interested readers should see [Mit99].

The CoreCover AlgorithmIn this algorithm, closed-world assumption is taken where views are mate-rialized from base/source relations [ALU01]. Among the possibly infinitenumber of rewritings, the aim is to find the ones that are guaranteed toproduce an optimal physical plan if there exists any. Since the rewritingaims towards query optimization, different cost models are also consid-ered in this algorithm. One particular difference of this algorithm is that,contrary to the previous algorithms, this algorithm aims at finding equiv-alent rewritings rather than contained rewritings. We will again redirectthe interested readers to [ALU01] for a better discussion of the CoreCoverAlgorithm.

Comparison of the AlgorithmsThe CoreCover Algorithm is quite different than the other algorithms.First, all the other algorithms aim at finding a maximally-contained rewrit-ing of the query whereas the goal of the CoreCover Algorithm is to find anequivalent rewriting. Second, closed-world assumption is taken which en-ables the algorithm to find an equivalent rewriting. Third, reformulationstage of query processing is like integrated with the optimization stagesince the rewriting has to guarantee an optimal plan for the query.

Of the remaining four algorithms, Bucket, MiniCon and Shared-Variable-Bucket Algorithms share the same spirit in that buckets are constructedand then cartesian product of the buckets are taken to produce the rewrit-ings. The deficiency of the Bucket Algorithm is that the constructed


buckets are unnecessarily large and this causes a lot of combinations to becomputed and tested for the second phase. MiniCon and Shared-Variable-Bucket Algorithms use a very close approach to prevent this deficiency.The MiniCon Algorithm has been shown to outperform both the Bucketand the Inverse-Rules Algorithms [PL00].

Finally, the Inverse-Rules Algorithm has the advantage that it is query-independent. That is, the rules are computed once and then they applyto any query afterwards. Also, the rules are easy to extend for additionalconstraints to be added to the system like functional dependencies [Lev00].On the other hand, the rewriting obtained by the Inverse-Rules Algorithmmay contain views that are not relevant to the query because this algo-rithm ignores the predicates that impose constraints on the variables. Anadditional phase which removes the irrelevant views may be added to thealgorithm, but this is shown to be very inefficient [Lev99b]. Also, Inverse-Rules Algorithm may have to consider a large number of rule unfoldingsduring query evaluation.

4.3 Query Optimization and Execution

Query optimization refers to the process of translating a declarative query intoan efficient query execution plan, i.e., a specific sequence of steps that the queryexecution engine should follow to evaluate the query. In addition to the op-erators and their application order specified in the query execution plan, theoptimizer should also decide on the specific algorithms that implement the op-erators and which indices to use with them. There may be many possible ex-ecution plans. The best execution plan can be chosen in two ways: cost-basedor heuristics-based. In the cost-based approach, the optimizer has to estimatethe costs of candidate plans and choose the cheapest of them. Cost estimationsare done using statistical information about the underlying data such as sizes ofthe relations and the selectivity of predicates. Heuristics-based plan generationinvolves using some rules of thumb like doing selections before joins. Usuallyheuristics-based technique is easier and cheaper than the cost-based one, be-cause it does not need to consider and evaluate the cost of all possible plans.However, the optimal plan is not guaranteed.

As discussed in the previous section, query reformulation step already pro-vides some optimizations on the query by pruning irrelevant sources and distin-guishing the overlapping sources to avoid redundant computation. Furthermore,the rewritten queries are to be as specific as possible. However, these are logicalor higher level optimizations. There are still many optimizations to be donewhen it comes to actually executing the logical plan generated by the reformu-lator physically on the data.

Query optimization in data integration systems is more difficult than theoptimization problem in traditional databases because:

• Sources are autonomous. Optimizer may not have any statistics or eitherhas few or unreliable statistics about the data stored in each of the sources.


• Sources are heterogeneous. They may have different query processingcapabilities. The optimizer needs to exploit these capabilities as muchas it can. In addition to what kind of queries the sources can processand how they can process them, it is also relevant that what kind ofprocessing power they have underlying their data management systemand performance changes due to workload changes.

• In traditional databases, it is easy to estimate the data transfer time sinceit is between the local disk and the main memory. In data integrationsystems however, data transfer time is not predictable due to the existenceof the network environment. Both delays and bursts may occur.

• On one hand, the sources are overlapping and there is redundancy formost of the time. That is why access to redundant sources should beminimized. On the other hand, some sources may become unavailablewithout any notice. Query optimizer should be able to handle these casesflexibly by replacing overlapping sources for each other to compensate forunavailability of any of them.

An additional problem that may cause inefficient query execution is that thelogical plan produced by the reformulator tends to have a lot of disjunctions,i.e., union operations.

The bottom line is that it is difficult to decide statically what the optimumstrategy would be to execute a query due to insufficient information and dynam-icity of the environment. Therefore, the traditional approach of first generatinga query execution plan and then executing it is no more applicable. [IFF+99]proposes an adaptive query execution approach in which query optimizationand execution are interleaved. In the rest of this section we mainly discuss thisapproach.

4.3.1 Adaptive Query Execution*

[Note: * refers to that this section is an advanced section and optional to thereader.]

In addition to the above listed problems, [IFF+99] makes the following ob-servations about query optimization in data integration systems:

• It is more important to aim at minimizing the time to get the first answersto the query rather than trying to minimize the total amount of work tobe done to execute the whole query.

• Usually the amount of data coming from the data sources is smaller com-pared to case of querying a single source as in traditional database systems.

Adaptivity in [IFF+99] exists in two levels:

• interleaved planning and execution


• adaptive operators for execution engine

At a higher level, the former is achieved by creating partial plans called frag-ments rather than complete plans. The optimizer decides how to proceed nextonly after executing a fragment. Once a fragment is completed, the optimizerwould know more about the sources and the environment so that it could dobetter planning for the rest of the query.

The latter includes using new operators during execution depending on theobservations listed above. Two important operators used in the Tukwila Systemdescribed in [IFF+99] are double-pipelined hash join and the collector operator.

Double-pipelined hash join is a join implementation that allows Tukwila toquickly return the first answers to the query in spite of the fact that some sourcesmay be responding very slowly. In contrary to the conventional hash join wheresmaller of the two relations to be joined is chosen as the inner relation to hashby the join attribute, in double-pipelined hash join, both relations are hashed.This way, result tuples are produced as soon as the data from sources arrive.This masks the slow data transmission rates of some sources. The optimizerno longer has to make a decision about which relation should be the inner one(Normally, it would have to know the size of the relations to be able to choosethe smaller one as the inner). Also, the processing is not blocked due to delaysat the sources.

The collector operator is used to facilitate union over large number of over-lapping sources. Using the estimates about the overlap relationships betweenthe sources and depending on the run-time behavior of the sources (delays, er-rors) optimizer adapts its policy about how the unions should be performed andthe collector operator achieves the application of this dynamic policy. Policiesare specified using rules.

Both levels of adaptivity are realized through event-condition-action rules.Events are raised by execution of the operators or completion of some fragmentsand obtaining some partial results. When an event triggers a rule, first the as-sociated condition is checked. If it is true, then the defined action is executed.Possible actions include reordering of operators, re-optimization, changing thepolicy of the collector operator and so on. The rules accompany the operatortree generated by the optimizer. They specify how to modify the implementa-tion of some operators (for example, the collector) during run-time if neededand conditions to check at points where fragments complete in order to detectopportunities for re-optimization.

4.3.2 Query Translation

One thing we have treated as a black box until now is how actually the sourcequeries in exported schemas (in schema of the sources but in the global datamodel) are translated into their actual schemas (in their local data models) andthen get executed by their native query processors. This step is called the querytranslation step. It is achieved by the source-specific wrappers. Data extractionfrom sources by the wrappers is the topic of the next section.


5 Techniques for Extracting Data: Wrappers

Data extraction deals with the issues that arise during the process of getting datafrom the different sources to the integration system. It combines techniques fromthe areas of database systems and artificial intelligence (such as natural languageprocessing and machine learning). In this section, first we discuss wrappers that,as we have seen in the previous sections, are important components of a dataintegration system. Then we review some work on tools for semi-automatic andautomatic wrapper generation.

During information integration from heterogeneous data sources, we have totranslate queries and data from one data model to another and from one dataschema to another. As we mentioned in Section 2, this is done by wrappersthat are written for each data source in the integration system. Each wrappertransforms queries in the unified schema to the queries in the format of thethe underlying data source and then translates the results back to the unifiedschema. We would like to note that mediator systems usually require morecomplex wrappers than do most warehouse systems.

5.1 Wrapper Generation Approaches

Wrapper designers can either construct the wrappers manually, or use sometools facilitating the wrapper code development. Three approaches are usuallyconsidered:

• ManualHard-coded wrappers are often tedious to create and may be impracticalfor some sources. For example, in case of web sources: the number of themcan be very big, new sources are added frequently, and both the structureand the content of any source may change [AK97]. All these factors leadto the high maintenance costs of manually generated wrappers.

• Semi-automatic (interactive)It was noted in [HGMN+97] that the part of a wrapper code whichdeals with the details specific to a particular source is often small. Theother part is either the same among wrappers or can be generated semi-automatically based on the declarative description given by a user. Tech-niques such as programming by example can be used for this purpose.

• AutomaticAutomatic generation means that there is no human involvement. Toolsfor automatic wrapper generation can be site-specific or generic. Theyusually need training in the initial stage and are based on the supervisedlearning algorithms.

5. TECHNIQUES FOR EXTRACTING DATA: WRAPPERS 43

5.2 Tools for Semi-automatic/Automatic Wrapper Con-struction

Here we review several techniques for semi-automatic and automatic wrappergeneration for structured/semistructured data . Most of them are designed forthe case of web sources. As we mentioned above, writing the wrapper code forweb sources may be especially hard due to the frequent changes of content andstructure of the sources. On the other hand, data on the web often has a partialstructure. This fact allows us to develop tools for automatically extracting thisdata.

5.2.1 Using Formatting Information in the Semistructured Pages onthe Web

HTML documents often have some internal hierarchy of information, but thishierarchy is not specified explicitly. For example, a site of a travel agencymay have information about several countries and hotels in the semistructuredformat. Some records, such as “a capital”, “money units”, “a language” willappear for all countries, while some others like “states” are country-specific.The presence of a partial structure in many web sources gives an integrationsystem designer an opportunity to generate wrappers for the sources of a partic-ular domain semi-automatically. Often this “semistructured” information maycome to the web from the databases underlying the web sources. This raises thequestion “Why could not we query these databases directly in this case?”. Un-fortunately, for a number of subjective reasons, a source may not set permissionsfor the outside users to query it.

The approach we describe was proposed by [AK97] and is used for semi-automatically generating wrappers for both multiple-instance sources and single-instance sources. A multiple-instance source contains information on severalpages that are all of the same format. An example is CNN’s weather pages7:pages for all cities have the same structure (for instance, there is always aCurrent Conditions section with temperature, humidity and wind specified).Wrapper must be able to answer queries about all sections of the individual page.A single-instance source contains a single page with semi-structured information.

The authors identify three steps of a wrapper generation process for thetypes of sources mentioned above: “structuring the source; building a parser;adding communication capabilities between sources, a wrapper, and a mediator”[AK97] .

1. Structuring the source

The first step refers to the finding heading tokens on a page, such as“Current Condition”, “Temperature”, “Wind”, and organizing them in ahierarchy tree. Such sections are usually stressed in the document by thesize of font (big), the type of font (bold, italic), by noticing a colon fol-lowing such a token, etc. All these simple heuristics, used by the authors,

7http://www.cnn.com/WEATHER/


proved to work well for the domains they specified.After a system has suggested the set of headings, a user may interfere bycorrecting the output. The hierarchy of the found headings is determinedbased on indentation spaces and font size. The grammar describing thestructure of pages of a web source is produced as the result of this step.Results published by the authors show that usually just few correctionsmade by the user are needed for a web source.

2. Building a parser

A parser for extracting any structured portion of data can be generatedautomatically given the output grammar of the first step.

3. Adding communication capabilities

First, a wrapper needs some mechanisms to fetch the appropriate pagesfrom the sources. In the case of a single page for each source, it is not aproblem as long as URL of this page is known. In the case of multiplepages for a source, we need to map a query to the URL or set of URLs.In the case of the CNN weather site, for example, we can specify that fora given state in the USA and a city in it, the URL of the page containingthe weather forecast can be obtained by adding the following to the endof the “http://www.cnn.com/WEATHER/” string:- the abbreviation of the region (for instance, ne stands for the north east);- the abbreviation of the state;- the name of the city and a 3-letter city abbreviation.For example, the URL for Providence, RI ishttp://www.cnn.com/WEATHER/ne/RI/ProvidencePWD.html.Second, a wrapper relies on some protocols to deliver data over the net-work. Authors of the paper [AK97] were using Perl scripts.Third, a wrapper and a mediator need to communicate between them-selves in the integrated system. In the reviewed system [AK97], KQML(Knowledge Query and Manipulation Language) was used for this.

5.2.2 Template-based Wrappers

The approach proposed by Hammer et. al. [HGMN+97] is applicable to severaltypes of data sources: relational databases, legacy systems, and web sources.Their wrapper implementation toolkit is based on the idea of template-basedtranslation. A designer of a wrapper uses a declarative language to define therules (templates) which specify the types of queries handled by a particularsource. For each rule he also defines an action to be taken in case a query sentby the mediator of an integration system matches the rule. This action willcause a native query - a query in the format of the underlying source - to beexecuted.

Filter queries are used to extend the set of queries a source can handle.If a source does not support some predicates, the query will be turned intotwo queries: the native query (that will contain only those predicates that are


supported by the source) and the filtered query that will “postprocess” theresults of the native query.

The process of query transformation consists of the following steps. First,a query from the mediator is parsed, then it is matched against the templatesin the system. If the matching rule was found, the native query is processed bythe data source, and the result is filtered with the filter query by the wrapper.

A rule-based language MSL [PGMA96] is used by the authors for queryformulation. Below we give an example of an MSL query, a template matchingit, and the corresponding native and filter queries. For the purpose of theexample (that is based on the example presented in [HGMN+97]), the datasource is a relational database.ExampleLet us refer again to the example of Acme Cars company that has a relation

Automobiles(vin, autoMake, autoModel, autoPrice, autoYear).

This relational database consisting of just one relation, is our data source. Weneed to write a wrapper supporting MSL queries to this source. We further as-sume, that the source does not support comparison predicates on the autoPriceattribute. Let a user A ask the MSL query about all Honda cars for sale whoseprice is less than 12,000$:

C :-- C:<Automobiles{<autoMake "Honda"><autoPrice P>}>AND LessThan(P, 12,000)

One of the templates, matching this query is:

C :-- C:<Automobiles{<autoMake $A >}>

Notice, that the result of this template query is a superset of the results askedby the user query. The action corresponding to this template is to select allautomobiles with the autoMake = $A. In the system, $A is substituted with"Honda" and a native SQL query for the relational database is produced:

SELECT *FROM AutomobilesWHERE autoMake = "Honda"

After the wrapper received an answer to this SQL query from the source, theonly thing remained, is to postprocess the results of this query using the follow-ing filter query:

C :-- C:<Automobiles{<autoPrice P>}>AND LessThan(P, 12,000)


This will only leave those "Honda" cars, whose price is less than 12,000$. Afterthat, the result can be returned back to the mediator of the integration system.

5.2.3 Inductive Learning Techniques for Automatically Learning aWrapper

These techniques are sometimes called wrapper induction techniques [Eik99] andare based on inductive learning. According to [Eik99], inductive learning is thetask of computing a generalization from a set of examples of some unknownconcept. This generalization should suggest the model explaining all of theexamples.

A very simple example of an inductive inference is when a teacher says asequence of numbers: 2, 4, 6, 8, 10; and then asks a pupil to guess the rule heused to produce the next number from the previous (in this case pn+1 = pn +2).

[Eik99] points out the following classification of the inductive learning meth-ods used for wrapper induction:

• Zero-order learningThey are also called decision tree learners as their solutions are representedas decision trees. The drawback of these methods is coming from thefact that they are based on the propositional logic that has a number oflimitations. For example, they can not deal with several relations in arelational database [Eik99].

• First-order learningMethods of this type can deal with first-order logical predicates.Inductive logic programming is a method of this class, widely used dueto the ability to deal with complex structures such as recursion. Twoapproaches - bottom-up and top-down - are often used as a part of thefirst-order learning.

The bottom-up approach first suggests a generalization based on few ex-amples. Then this generalized model is corrected based on the other ex-amples.The top-down approach starts with a very general hypothesis and thendistills it learning from negative examples.

Some known systems for inductive learning of wrappers are STALKER [MMK98]and the system described by Kushmerick et al. [KWD97]. An excellent overviewof some other systems for information extraction by inductive learning is givenin [Eik99].

Here we discuss the system developed by Kushmerick et al. [KWD97]. Thisapproach is suited for the text sources (HTML-pages) with a tabulated structureand with the following delimiters: head, right, left and tail.ExampleLet us consider an HTML page with the list of pupils in the class with thecorresponding GPA (this example is based on the example in [KWD97]). Forthe simplicity, we assume that each pupil is identified by only the second name.


<HTML><TITLE>Current GPA of the students</TITLE><BODY>Current GPA of the studentsSimpson 2.72 Johnson 3.5 Peterson 4.0 <HR>End</BODY></HTML>

 and (as well as and ) are called left and right delimitersand separate the data on the HTML page. However, the first and the laststrings also contain these delimiters; so in order to distinguish between tuplesand heading/ending of the HTML page, a head tag and a tail tag <HR>can be used [KWD97]. This set of delimiters makes the job of a wrapper simple:first skip a string with in the end; then, till the marker of the end is reached,fill out tuples with the data surrounded by (, ) and (, ).

The algorithm the authors developed is used for automatic generation ofwrappers for HTML-sites with such structure. Missing data is not allowed, andthe order is also strict. They call this class of wrappers HLRT-wrappers (head,left, right and tail).

First, a number of HTML pages is labeled. Labeling in this case meansthe specification of the tuples, contained in the HTML page. In the exampleabove, we would label that HTML page with {(Simpson, 2.72), (Johnson, 3.5),(Peterson, 4.0)}. The hypothesis in this case is a set of tags used to separateeach attribute (in the example, the wrapper should learn that and are used to separate family names; and - GPA-s. Labeling by handis pretty laborious, so the authors described how a set of heuristics (domain-specific) can be used as an input for a labeling algorithm to semi-automateit.

Induction algorithm learns from these labeled data. Iteratively, the algo-rithm constructs a set of delimiters consistent with the labeled pages. It is doneby considering possible combinations of tags present in the pages till the con-sistent set of tags is found. The question is ”How many examples are enoughto conclude that the set of delimiters that is consistent with all examples so far,will be consistent with the remaining examples?”. More formally, we need toknow how many examples are enough to say that with probability ε we learneda wrapper correctly with confidence δ. The authors provide the formulas theyused to estimate it.


6 Materialized View Management

A view defines a derived relation from a set of database relations. It is actually aquery whose result is given a name that can be used like other ordinary relationnames stored in the database. When the tuples of the virtual relation definedby a view are physically stored in a database, we call such a view materializedview.

Use of materialized views dates back to early 1980s [GM99]. They werefirst proposed to be used as a tool to speed up queries on views. Then theywere also used to maintain integrity constraints and to detect rule violations inactive databases. They gained serious reconsideration by the emergence of newapplications like data warehousing. In this section, we discuss issues related touse of materialized views in data integration systems.

We presented the materialized view approach to data integration in Sec-tion 2. The term comes from the fact that a set of views are derived from thedata sources and the answers to those views are actually stored in a repositorycalled a data warehouse. The main purpose of this pre-computation is to improvequery response time. None of the complex query processing steps described inSection 4 is needed to be able to answer a user query in data warehouses at thetime the query is asked. Those steps are completed and the results are collectedin advance of the queries. This not only provides the ability to answer manyqueries very quickly, but also increases the availability of the system since thewarehouse continues to answer queries even if the underlying data sources maybecome inaccessible for some reason at the time of querying.

Of course the benefits mentioned above do not come for free. First, theviews to be materialized need to be determined. Usually it is both costly andredundant to materialize all the derived relations defined by the views whichconstitute the unified schema of the integrated system. The most beneficialviews need to be selected based on criteria like frequently asked queries. Secondand more importantly, the views that are selected to be materialized need to bemaintained. View maintenance refers to the process of synchronizing deriveddata stored at the warehouse with the updates on the base data, i.e. data storedat the underlying data sources. The naive way of maintaining views would beto re-materialize the views when a relevant update occurs. However, this is notdesirable for good performance.

In this section, we first present the materialized view selection problem to-gether with the proposed solutions. Then we discuss various approaches formaintaining the selected views efficiently.

6.1 Design and Selection of Views to Materialize

The problem of materialized view selection can be defined as follows: Givena set of queries to the integrated system with their access frequencies and aset of source relations with their update frequencies, find a set of views to bematerialized such that the total query response time (i.e. query processing time)and the cost of maintaining the selected views are minimized [YKL97]. There

6. MATERIALIZED VIEW MANAGEMENT 49

may also be other resource constraints to be considered such as disk space, butthe most important of all is the maintenance cost/time.

Previous research on this problem has concentrated on Multiple-Query Op-timization (MQO) techniques. MQO is the problem of finding an optimal queryexecution plan for evaluating a set of queries simultaneously. Techniques usedinvolve identifying the common subexpressions among queries, executing thoseonce and reusing later. In general, there may be many possible plans for eachquery and there may also be many possible ways of combining them. Thus, thesearch space is really large. Two general approaches are: (i) producing local op-timal plans for each query and then merging them, which does not guarantee anoptimal solution, and (ii) generating a globally optimal plan, which has a largersearch space. [SG90] has proven that MQO problem is NP-complete. Proposedsolutions usually make use of heuristics to find a solution as close as possible tothe optimal solution. The related work on materialized view selection follows asimilar path.

[YKL97] proposes a method where a Multiple View Processing Plan (MVPP)is constructed from the set of queries. Then some parts of this plan are selectedto be materialized. The cost comparisons are based on the following cost mea-sures: Cost of a query is the number of rows in the table used to constructthat query. Cost of query processing is the frequency of the query multipliedby cost of query access from materialized nodes. Cost of view maintenance isequal to the cost of constructing the view, i.e. re-materialization is assumed.Total cost is equal to the sum of the cost of query processing and the cost ofview maintenance.

There are two stages to view selection:

1. finding a good MVPPMVPP is the global query execution plan in which local execution plansfor individual queries are merged based on shared operations on commonsets [YKL97]. There are two ways of finding the MVPP:

• merging local optimal query plansLocal optimal plans are computed for each query. Then the queriesare ordered in a descending fashion based on their query processingcosts multiplied by their access frequencies. If there are k queries, kMVPPs are constructed as follows:

for i=1 to k dotake the ith local query plan andincorporate all the others to it in order

The view selection algorithm at stage 2 will be run on these k MVPPsand then the least costly one will be chosen. This approach takeslinear time in terms of the number of queries.

• generating a globally optimal planRather than the locally optimal plans, all possible plans for eachquery are considered. The problem is mapped to a 0-1 integer linear


programming problem which is stated as follows: Select a subsetof the join plan trees such that all queries can be executed and thetotal query processing cost is minimum [YKL97]. Then the set of jointrees found are used to construct the MVPP. Solution to the linearprogramming problem is the optimal solution. However, solving it isexponential in the number of queries. Therefore, usually near-optimalsolution is found.

2. selecting views to materialize from the MVPPAn execution tree is built for the given MVPP whose nodes correspond tointermediate results to the queries. We can simply choose the completetree or all the leaf nodes for materialization. These correspond to mate-rializing all the queries and all the base relations, respectively. However,the aim is to find a set of intermediate nodes to materialize such that thetotal cost for query processing and view maintenance is minimized. Thebrute force way of finding this set is to compare the cost of every possiblecombination of nodes. This is not efficient. Some heuristics should beused. The algorithm presented in [YKL97] is based on the following idea:Whenever a new node is considered to be materialized, we calculate thesaving it brings in accessing all the queries involved, subtracting the costfor maintaining this node. If the value is positive, then this node will bematerialized and added into the solution set.

A somewhat similar approach is presented in [Gup97] which is based on usinggreedy heuristics and AND-OR graphs. An AND-OR graph represents a set ofquery plans. AND-OR graphs of the queries are merged to obtain an AND-ORview graph. Each node in the AND-OR view graph represents a view that couldbe selected for materialization. The problem is to choose among the nodes ofthe AND-OR view graph such that sum of total query response time and totalmaintenance time is minimized. [Gup97] states that the minimum set coverproblem can be reduced to this problem and it is NP-hard. A near-optimalalgorithm is presented which uses greedy heuristics. The set of the selectedviews has a benefit and at each step views that would increase the benefit ofthis set would be added to the set. Special cases of AND view graphs, OR viewgraphs, view graphs with indices are also investigated in [Gup97].

[RSS96] and [MRRS00], which mainly focus on the view maintenance prob-lem, indirectly cover some methods that are applicable to view selection. [RSS96]proposes to augment a given set of materialized views with an additional set ofviews that may reduce the total maintenance cost. The selection problem hereis to determine the additional views. [MRRS00] applies MQO techniques bothto view selection and maintenance. Selection comes into play where additionalviews are to be materialized temporarily for efficient maintenance. The claimis that the same techniques are also applicable to selection of permanent viewsto materialize.

There are also research studies in materialized view selection for the specialcase of data cubes [HRU96] and multidimensional datasets [SDN98] in OLAP.We do not present them in detail here.


6.2 The Problem of View Maintenance

Materialized views are derived from data originally stored at multiple datasources. As primary copies of data at the data sources get updated, materi-alized views become stale or inconsistent with the underlying data. We call theprocess of bringing the materialized views up-to-date with the changes in theunderlying data view maintenance. A materialized view can always be broughtup-to-date by re-evaluating the view definition. However, recomputing the viewsevery time the base data changes is not very efficient. Besides, [GM95] pointsout that in general only a part of the view changes in response to changes inthe base relations, which is called the heuristic of inertia. Thus, only the partsof the views that are affected from the changes need to be computed and up-dated. This is called incremental view maintenance. The following exemplifiesincremental view maintenance.

ExampleConsider the following base relation stored at some data source:Cars(vin, status, model, year, price)Assume that the following view defined on Cars is materialized at the data

warehouse:CheapCars(vin, price) :- Cars(vin, status, model, year, price),

price ≤ 3000When a new car tuple <471, "used", "Mazda", 1992, 2500> is added

into the Cars table at the data source, CheapCars view needs to be updated.The only modification needed is addition of the tuple <471, 2500>. Thewhole view need not be recomputed. When another car tuple <839, "used","LandRover", 1996, 15000> is added, however, CheapCars does not need anymodification because the new tuple does not satisfy the view definition.

In this subsection, we present the dimensions of the problem and alternativepolicies for view maintenance. Incremental view maintenance techniques will bediscussed in the following subsections.

6.2.1 Dimensions of the Problem

The following parameters determine the complexity of the view maintenanceproblem [GM99]:

• Available InformationIt refers to the amount of information available to the view maintenancealgorithm. The view definition and the actual update occurred on basedata have to be known to the algorithm. In addition to that, informa-tion like the content of the materialized views, the contents of the baserelations, the definitions of other views and integrity constraints at thedata sources might also be accessible to the algorithm. Depending onhow much information is available, the task of view maintenance might befacilitated. For example, if we knew that a certain attribute is a key atthe underlying data source, then we would also know that every insertionwould have a different value for that attribute. Hence, an insertion at the


source would require an insertion at the materialized view that refers tothat attribute.

• Allowable ModificationsIt determines what modifications can be handled by the view maintenancealgorithm given other parameters. These might include insertions, dele-tions, updates, group updates, etc.

• Expressiveness of the View Definition LanguageView definition language may also facilitate or complicate the task of viewmaintenance. Views might be defined at various levels of expressivenessthrough languages including conjunctive queries, aggregation, recursion,negation, etc.

• Database and Modification InstanceCurrent contents of the data sources or the materialized views and themodification may also determine the capabilities of the maintenance algo-rithm.

• ComplexityA dimension that is somewhat at a different level than the others is thecomplexity dimension which refers to the efficiency of the view mainte-nance algorithm. Complexity can be measured in multiple sub-dimensionsincluding complexity of view maintenance language, view maintenance al-gorithm or amount of extra information needed.

6.2.2 View Maintenance Policies

There are two main steps in materialized view maintenance: propagate andrefresh [GM99]. Propagate step involves computing changes to be done onthe materialized views upon changes to the base data and in refresh step thecomputed changes are actually applied on the materialized views. Propagatestep always precedes the refresh step. The decision of when to perform therefresh step is called a view maintenance policy. Maintenance policies can becategorized as follows:

• Immediate View MaintenanceRefreshing is done within the transaction that changes the base data. Theadvantages of this policy are that queries are processed fast and alwaysreturn up-to-date results. The reason for this is that materialized viewsare brought up-to-date in advance of the queries. On the other hand, thispolicy slows down the transactions at the data sources since propagationand refreshing are to be done in the transaction’s scope. Besides, thispolicy may not always be applicable when the data sources are fully au-tonomous and their commit decisions can not be delayed by the integratedsystem.

• Deferred View MaintenanceRefreshing on the views is done later than the transaction that changes


the base data. Log of changes to the base data are to be kept. This policyallows batch updates by applying all the changes collected in the log tothe views at the same time. There are three deferred view maintenancepolicies:

– Views are refreshed lazily at query time. It is guaranteed that queryanswers will be consistent with the base data and this is done withoutslowing down the transactions at the sources. However, queries tothe integrated system are processed more slowly.

– Changes to views are forced after a certain amount of change to thebase data have been done. Both transaction and query performancesare good, but queries may return non-up-to-date results.

– Refreshing is done periodically in certain time intervals. This is alsocalled the snapshot maintenance. Again, in spite of the good trans-action and query time, queries may return non-up-to-date results.

In general, immediate maintenance does not scale with the number of mate-rialized views, but deferred maintenance does. Therefore the decision of whichviews to maintain immediately has to be made very selectively. If real-timequeries are asked on a view for which consistent results are crucial, then thatview should be maintained immediately. Views which are queried relatively in-frequently can be maintained in a deferred fashion. Usually decision supportapplications, where a stable copy of the derived data is more important thanfreshness, use periodical deferred policy. [CKL+97] provides a decent study onconsistency and performance issues in supporting multiple view maintenancepolicies. Materialized views that are related to each other may become in-consistent if they are maintained under different policies. Mutual consistencybetween views has to be settled.

The next question to ask is how view maintenance is applied. In the nextsubsections, we discuss how actually the maintenance should be performed.

6.3 Incremental View Maintenance

Incremental view maintenance algorithms have been investigated for a long timeas an efficient alternative to re-materialization. Most of the work in this areaconsider the problem for centralized database systems where materialized viewsare used for purposes like speeding up queries on views or implementing rulechecking efficiently. The problem has additional facets when considered in thescope of data integration systems. However, previous work still applies to somecases and form the basis of algorithms for data integration applications. Webelieve the following categorization of incremental view maintenance algorithmsclarifies the link between the two cases:

• pre-update algorithms: maintenance is performed before the base rela-tions have been actually updated, as in the case of immediate mainte-nance policy where maintenance is performed within the transaction thatis updating the source.


• post-update algorithms: maintenance is performed after the transactionthat updates the relevant base relations is over.

Previous methods that apply to centralized databases naturally involve pre-update algorithms because the base relations and the materialized views areparts of the same system. However, data integration systems have to use post-update algorithms since the underlying sources are autonomous and they areunaware of the maintenance procedures that are occurring in the integrated sys-tem. We can not force them to include maintenance procedures within their up-date transactions. As stated in [ZGMHW95], information sources are decoupledfrom the data warehouse. This brings additional problems about consistency.

In this section, our focus is on methods devised for incremental view mainte-nance in general. We present techniques specifically on data integration systemsin the next subsection.

[GM95] provides a survey of incremental view maintenance algorithms clas-sifying them according to view language and available information dimensions.Here we discuss some of them without giving an explicit classification. Our aimis to give a flavor of the important issues that are addressed in many of thosealgorithms.

[BLT86] handles Select (S), Project (P) and Join (J) views in isolation firstand then considers them together as SPJ views. For each case, both insertionsand deletions are considered. For S views, inserted tuples are simply unionedand deleted tuples are simply subtracted from the materialized view data set.Updating P views when deletion occurs in the base relation is more complicated.The problem stems from the fact that a tuple in the view that is projectedon some particular attribute may be there due to multiple tuples in the baserelations. If one of these base tuples is deleted, the derived tuple may not haveto be deleted since there are other existing base tuples that it is derived from.This problem is solved by using counters for view tuples. A view tuple wouldhave to be deleted when the counter dropped to 0. Upon insertions of newtuples to one of the join relations, J views should only perform join betweenthe newly added tuples and the other join relation rather than computing thejoin between two relations from scratch. Deletions are handled in a similar wayby only joining the deleted tuples and then subtracting those from the originalview. These methods are further combined together for SPJ views [BLT86].

[GMS93] presents two algorithms: Counting algorithm and DRed (Deletionand Re-derivation) algorithm. In both of these algorithms, the emphasis is ondeletion since it is more problematic. Counting algorithm is proposed for non-recursive views with negation and aggregate functions. It is based on the countermethod of [BLT86], but the view language is more general. For each tuple inthe materialized view, number of alternative derivations is stored as the count.Relevant insertions increment the count and relevant deletions decrement thecount by 1. When the count drops to 0, the tuple need not be stored in thematerialized view any more. This algorithm also works with recursive viewsonly if every tuple has finite number of derivations. DRed algorithm works forgeneral recursive views with negation and aggregation. This algorithm involves


three basic steps: (i) ignore the alternative derivations and put the view tupleinto the delete set if it gets invalidated at least by one of its derivations, (ii)remove the tuples from the delete set if they have other derivations, (iii) computethe tuples to be inserted to the views due to insertions to base relations.

In addition to the algorithmic approaches as summarized above, there arealgebraic approaches to incremental view maintenance. [GL95] presents an ap-proach based on multi-set/bag semantics. All the arguments are based on theequivalence of bag-valued expressions. Bag algebra expressions are used to rep-resent the materialized views. Given a transaction that changes the state of thedatabase and a set of bag expressions, they try to derive delta expressions whichrepresent how the bag algebra expressions need to be updated. The goal is tofind a minimal set of such delta expressions. [GL95] also emphasizes that properhandling of duplicates is important for computing the aggregate functions (likeaveraging a list of values) correctly.

[RSS96] explores what additional views should be materialized for optimal in-cremental maintenance of a given materialized view. [MRRS00] generalizes thisidea to how to maintain a set of views efficiently by using additional temporar-ily or persistently materialized views. Approach involves materializing commonsubexpressions between view maintenance expressions as in multi query opti-mization algorithms. We mentioned this approach before in this section as wepresented the selection of views to materialize.

6.4 View Maintenance in Data Integration Systems

The main problem in data integration systems in terms of incremental viewmaintenance is that maintenance has to be done after the updates at the datasources have occurred. Later, when the maintenance has to take place, theintegrated system may need to ask additional queries to the data sources if itdoes not have all the information needed to perform the maintenance. If thedata sources continue to change in the time between the updates known by theintegrated system and the maintenance time, then the additional queries willbe answered according to the new state of the data sources which is differentthan the one at the time of initial update (i.e., the update which is trying tobe fixed at the integrated system). This is called a state bug [CGL+96] or viewmaintenance anomaly [ZGMHW95]. This problem stems from the fact that pre-update maintenance algorithms can not be used for data integration systems asthey are. [CGL+96] proposes two ways of avoiding the state bug:

• using the pre-update algorithms but restricting the updates and views sothat correctness is guaranteed

• developing specific algorithms for the post-update case

[CGL+96] proposes new algorithms for post-update case which are based onthe usage of database invariants, i.e. conditions that are guaranteed to hold atevery state of the database. These are used to maintain correctness. As [GL95],algebraic approach based on bag semantics is taken. [CGL+96] also emphasizes


the minimization of the view down-time. Usually views become inaccessible forqueries during maintenance. [QW97] addresses this problem through a two-version no locking (2VNL) algorithm. Two concurrent versions of the materi-alized views provide continuous and consistent access to the warehouse duringmaintenance.

[ZGMHW95], on the other hand, is based on a pre-update view mainte-nance algorithm. The algorithm in [BLT86], which we briefly summarized inthe preceding subsection, is used as basis. [ZGMHW95] proposes ECA (Ea-ger Compensating Algorithm) in which extra compensating queries are used toeliminate anomalies. In fact anomalies would not occur if we re-computed theviews or stored the copies of base relations referenced in the views, but bothof these options are too costly and not good options compared to incrementalview maintenance. In ECA, the basic idea is to send compensating queries tothe data sources to avoid the potential anomalies that may occur according toquery answers coming from the data sources. In other words, the warehouseeagerly forces data sources to send correct information. This is done by an-ticipating what kind of anomalies can occur beforehand and preparing viewmaintenance queries which contain compensating expressions in addition to theview maintenance expressions that would avoid the anomalies.

Next subsections discuss how the incremental maintenance process could bemade more efficient.

6.5 Update Filtering

Not all the updates at the data sources cause updates at the materialized views.We can speed up the maintenance process if we can detect which base dataupdates have no effect on the views, and hence need not be maintained. Suchupdates are called irrelevant updates and the procedure of pruning irrelevantupdates from the maintenance plan is called update filtering. [BCL89] callsqueries/views that are not affected from the updates queries independent ofupdates.

Most of the work in this area aims at theoretically defining necessary andsufficient conditions for the detection of irrelevant updates for the cases of in-sertions, deletions and modifications [BLT86, BCL89, LS93]. [BCL89] definesirrelevant updates as update operations applied to a base relation has no effecton the state of a derived relation independently of the database state. [LS93]reduces the update independence problem to equivalence problem for Datalogprograms and provides decidability results for different cases. [BLT86] presentsmore practical algorithms for detecting irrelevant updates. The views consideredare in the form of PSJ queries. Selection condition is the primary determinantfor deciding relevance. For insertions at the base relations, we substitute thevalues of the inserted tuple in the selection condition of the view. If the selectioncondition becomes unsatisfiable, then the insertion is irrelevant to the view, i.e.,no tuple needs to be inserted to the view. Else, the insertion may be relevantto the view. Similarly, for deletions, we substitute the values of the deletedtuple in the selection condition of the view. If the selection condition becomes


unsatisfiable, then the deletion is irrelevant to the view, i.e., no tuple from theview needs to be deleted. In general, satisfiability of boolean expressions isNP-complete. [BLT86] assumes boolean expressions that are conjunctions ofinequalities. Then the problem can be solved in polynomial time. It can alsobe generalized to disjunctions of conjunctions, which adds a linear factor to thecomplexity.

In conventional database systems, update filtering can be implemented usingintegrity constraints or triggers. The base relations are not decoupled from thederived relations. View definitions are known to the whole system. However, indata integration systems, the filtering has to be done at the integration systemlevel. Data sources can not perform filtering since they are not aware of theview definitions.

6.6 View Self-Maintenance

Another way to speed up maintenance is to minimize external data source access.As we mentioned earlier, to maintain a view, we may need to ask queries tothe data sources in addition to the update information itself. This requiresto communicate with the sources. We should try to exploit the informationavailable at the integrated system (data warehouse) as much as we can to avoidthis communication.

In general, self-maintenance refers to views being maintained without usingall the base data. There exists different notions of its exact meaning dependingon how much information is available. The ideal case is that the view update isperformed locally at the integrated system by only knowing the particular basedata update that has occurred, the view definitions and the materialized data.Whenever this is not possible, we need additional techniques to minimize basedata access.

The first thing to do is to decide whether a given view is self-maintainable ornot. If it is, then we need to know how to achieve self-maintenance. Otherwise,techniques may be developed to make it self-maintainable. Self-maintainabilitycan be both investigated on a single view or on multiple views. Initially, we canconsider each view in isolation. It should also be noted that self-maintainabilityis an issue specific to data integration systems. In traditional (centralized)databases, since all information is known to the system, there is no context forself-maintainability.

[GJM96] aims at defining self-maintenance rules for SPJ views. Self-maintain-ability algorithms are highly dependent on the view definition language. Threeissues are investigated: (i) which relation is modified, (ii) what type of modifi-cation, and (iii) if key information can be exploited. The results they have comeup with are as follows:

• For insertions, SP views are self-maintainable. SPJ views are self-maintain-able only if join is a self-join (i.e. relation R is joined with itself) and joinattribute is the key of R. Other SPJ views are not self-maintainable.

• For deletions, SPJ views are self-maintainable.


• For updates, if modeled as deletion followed by insertion, the rules forinsertions and deletions hold. Otherwise, SPJ views are self-maintainableif updates are on non-exposed (i.e. not involved in any predicate in theview definition) attributes.

[BCL89] explores similar conditions for a more general view definition language.[Huy97] investigates the meaning of self-maintainability at different contexts.

They show that self-maintainability can be reduced to the problem of decidingquery containment.

There are several techniques to make views locally maintainable [Huy97]:

• Multiple-View Self-MaintenanceViews that are not self-maintainable when considered in isolation maybecome collectively maintainable at the integrated system when they areconsidered together [Huy97]. In other words, the information available toeach view is extended to all the materialized views at the warehouse inaddition to its own definition and materialization.

• Batch UpdatesRather than maintaining each update operation separately, if we save theupdates and maintain them all together, then the amount of work maybe reduced. For example, if an update operation deletes a tuple anda following update inserts the same tuple back, then these two updateshave no effect on the state of the materialized views when considered asa whole.

• Auxiliary Materialized ViewsBy materializing additional views, other views may become self-maintainab-le. The basic idea here is to increase the amount of information availableat the integrated system level.

Lastly, one important point to note is that self-maintenance also removes thesituations where anomalies can occur when the maintenance is totally performedlocally at the integrated system. The reason for this is that anomalies are causedby additional queries asked to the data sources some time after the relatedupdate has occurred. If a view is self-maintainable, then no additional queryingis necessary.

6.7 Dynamic View Management

As stated earlier, materialized view management has two important compo-nents: view selection and view maintenance. Until now we assumed that viewsto materialize are selected once at the beginning according to some statistics onfrequently asked queries and base data update frequencies and then the selec-tion is over. From that point on, the system concentrates on the maintenance ofthose selected views. This kind of view management is called static view man-agement. The major problem with this approach is that if the query workload


or base data update patterns change, then the decisions about view selectionbecome invalid.

The solution proposed in [KR99] is dynamic view management in whichview selection and view maintenance stages are unified. The query workloadis continuously monitored by the system and view selection decisions are up-dated dynamically. The constraints to be considered in addition to the changingworkload patterns include disk space and maintenance window. Maintenancewindow has more importance than space because usually the system is unavail-able for queries while the maintenance is being carried out. This time windowhas to be kept as short as possible. The more number of views materialized,the longer the maintenance window is. However, more materialization speedsup the query processing. Therefore, a compromise has to be made.


7 Concluding Remarks

Data integration services are important ingredients of network data services.They present an opportunity for transparent access to many different datasources residing on the network, as if these data sources were in fact a sin-gle data source. In other words, they hide from the user the fact that there isa diversity of data sources.

In this chapter, we have presented major problems in building and operatingdata integration systems together with a survey of proposed solutions. Most ofthe current data integration methods are still under research and most of thesystems mentioned throughout this chapter are research prototypes. There isa need to apply the proposed techniques to real-life systems to improve theirperformances.

One important point we have not mentioned in the chapter is the standard-ization issues that are under development to represent and communicate dataon the web in an easier way. The most well-known of these standards is theeXtensible Markup Language (XML) [XML00]. As XML or similar standardiza-tion efforts became widely used, the data integration services could also benefitfrom this. The heterogeneity in data representations could decrease and moreeffective solutions for data integration would be possible.

The amount and availability of data that exist on the network are growingin an increasing speed. The need to bring these data together to infer usefulinformation is also rising. As long as there will be data, there will also be needfor data integration services.

Bibliography

[AHK96] Y. Arens, C. Hsu, and C. A. Knoblock. Query Processing inthe SIMS Information Mediator. In A. Tate, editor, AdvancedPlanning Technology, pages 61–69. AAAI Press, Menlo Park, CA,1996.

[AK97] N. Ashish and C. Knoblock. Semi-automatic Wrapper Genera-tion for Internet Information Sources. In Second IFCIS Interna-tional Conference on Cooperative Information Systems (CoopIS),Charleston, SC, 1997.

[AKS96] Y. Arens, C. A. Knoblock, and W. Shen. Query Reformulationfor Dynamic Information Integration. Journal of Intelligent Infor-mation Systems (JIIS) - Special Issue on Intelligent InformationIntegration, 6(2/3):99–130, 1996.

[ALU01] F. N. Afrati, C. Li, and J. D. Ullman. Generating Efficient Plansfor Queries Using Views. In ACM SIGMOD International Con-ference on Management of Data, Santa Barbara, CA, May 2001.

[Ash00] N. Ashish. Optimizing Information Mediators By Selectively Ma-terializing Data. PhD thesis, USC, March 2000.

[BCL89] J. A. Blakeley, N. Coburn, and P. Larson. Updating DerivedRelations: Detecting Irrelevant and Autonomously ComputableUpdates. Transactions on Database Systems (TODS), 14(3):369–400, September 1989.

[BF94] M. Bonjour and G. Falquet. Concept Bases: A Support to Infor-mation Systems Integration. Proceedings of CAiSE94 Conference, Utrecht, 1994, 1994.

[BKLW99] S. Busse, R. Kutsche, U. Leser, and H. Weber. Federated In-formation Systems: Concepts, Terminology and Architectures.Technical Report 99-9, Berlin Technical University, 1999.

[BLT86] J. A. Blakeley, P. Larson, and F. Wm. Tompa. Efficiently Updat-ing Materialized Views. In ACM SIGMOD International Con-

61

62 BIBLIOGRAPHY

ference on Management of Data, pages 61–71, Washington, DC,May 1986.

[Bor95] A. Borgida. Description Logics in Data Management. IEEETransactions on Knowledge and Data Management, 7(5):671–682,October 1995.

[CD97] S. Chaudhuri and U. Dayal. An Overview of Data Warehousingand OLAP Technology. SIGMOD Record, 26(1):65–74, 1997.

[CGL+96] L. S. Colby, T. Griffin, L. Libkin, I. S. Mumick, and H. Trickey.Algorithms for Deferred View Maintenance. In ACM SIGMODInternational Conference on Management of Data, pages 469–480,Montreal, Canada, June 1996.

[CGL+98] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, andR. Rosati. Description Logic Framework for Information Integra-tion. In Principles of Knowledge Representation and Reasoning,pages 2–13, 1998.

[CGMH+94] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Pa-pakonstantinou, J. Ullman, and J. Widom. The TSIMMISProject: Integration of Heterogeneous Information Sources. In10th Meeting of the Information Processing Society of Japan(IPSJ), pages 7–18, Tokyo, Japan, October 1994.

[CKL+97] L. S. Colby, A. Kawaguchi, D. F. Lieuwen, I. S. Mumick, andK. A. Ross. Supporting Multiple View Maintenance Policies.In ACM SIGMOD International Conference on Management ofData, pages 405–416, Tucson, AZ, June 1997.

[CLN99] D. Calvanese, M. Lenzerini, and D. Nardi. Unifying Class-BasedRepresentation Formalisms. Journal of Artificial Intelligence Re-search, 11:199–240, 1999.

[Eik99] Line Eikvil. Information Extraction from World Wide Web. ASurvey, July 1999.

[EJ95] L. Ekenberg and P. Johannesson. Conflictfreeness as a Basis forSchema Integration. In Conference on Information Systems andManagement of Data, pages 1–13, 1995.

[FKL97] D. Florescu, D. Koller, and A. Levy. Using Probabilistic Infor-mation in Data Integration. In International Conference on VeryLarge Data Bases (VLDB), pages 216–225, Athens, Greece, Au-gust 1997.

[Fri99] M. T. Friedman. Representation and Optimization for Data In-tegration. PhD thesis, University of Washington, 1999.

BIBLIOGRAPHY 63

[GJM96] A. Gupta, H. V. Jagadish, and I. S. Mumick. Data Integration us-ing Self-Maintainable Views. In International Conference on Ex-tending Database Technology (EDBT), pages 140–144, Avignon,France, March 1996.

[GL95] T. Griffin and L. Libkin. Incremental Maintenance of Views withDuplicates. In ACM SIGMOD International Conference on Man-agement of Data, pages 328–339, San Jose, CA, June 1995.

[GM95] A. Gupta and I. S. Mumick. Materialized Views: Problems, Tech-niques, and Applications. Data Engineering Bulletin, 18(2):3–18,June 1995.

[GM99] A. Gupta and I. S. Mumick, editors. Materialized Views: Tech-niques, Implementations, and Applications. MIT Press, 1999.

[GMS93] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. MaintainingViews Incrementally. In ACM SIGMOD International Conferenceon Management of Data, pages 157–166, Washington, DC, May1993.

[GMUW00] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database SystemImplementation, chapter 11: Information Integration. PrenticeHall, 2000.

[Gup97] H. Gupta. Selection of Views to Materialize in a Data Warehouse.In International Conference on Database Theory (ICDT), pages98–112, Delphi, Greece, January 1997.

[Hal95] G. Hall. Negotiation in Database Schema Integration. In The In-augural AIS Americas Conference on Information Systems, Pitts-burgh, PA, August 1995.

[Has00] W. Hasselbring. Information System Integration. Communica-tions of the ACM, 43(6):33–38, 2000.

[HG92] R. Herzig and M. Gogolla. Transforming Conceptual Data Modelsinto an Object Model. In International Conference on Concep-tual Modeling / the Entity Relationship Approach, pages 280–298,1992.

[HGMN+97] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Bre-unig, and V. Vassalos. Template-based wrappers in the TSIMMISsystem. In Workshop on Management of Semistructured Data,Tucson, Arizona, May 1997.

[HGMW+95] J. Hammer, H. Garcia-Molina, J. Widom, W. Labio, andY. Zhuge. The Stanford Data Warehousing Project. IEEE Bul-letin of the Technical Committee on Data Engineering, 18(2):41–48, 1995.

64 BIBLIOGRAPHY

[HM85] D. Heimbigner and D. McLeod. A Federated Architecture forInformation Management. ACM Transactions on Office Informa-tion Systems, 3(3):253–278, 1985.

[HRU96] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementingdata cubes efficiently. In ACM SIGMOD International Confer-ence on Management of Data, pages 205–216, Montreal, Canada,June 1996.

[HSC+97] M. N. Huhns, M. P. Singh, P. E. Cannata, N. Jacobs, T. Ksiezyk,K. Ong, A. P. Sheth, C. Tomlinson, and D. Woelk. TheCarnot Heterogeneous Database Project: Implemented Applica-tions. Distributed and Parallel Databases Journal, 5(2):207–225,1997.

[Huy97] N. Huyn. Multiple-View Self-Maintenance in Data WarehousingEnvironments. In International Conference on Very Large DataBases (VLDB), pages 26–35, Athens, Greece, August 1997.

[HZ96] R. Hull and G. Zhou. A Framework for Supporting Data Integra-tion Using the Materialized and Virtual Approaches. In ACMSIGMOD International Conference on Management of Data,pages 481–492, 1996.

[IFF+99] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S.Weld. An Adaptive Query Execution System for Data Integration.In ACM SIGMOD International Conference on Management ofData, pages 299–310, Philadelphia, PA, June 1999.

[JLYV00] M. Jarke, M. Lenzerini, and P. Vassiliadis Y. Vassiliou. Funda-mentals of Data Warehouses. Springer Verlag, 2000.

[JPSL+88] G. Jacobsen, G. Piatetsky-Shapiro, C. Lafond, M. Rajinikanth,and J. Hernandez. CALIDA: A Knowledge–Based System forIntegrating Multiple Heterogeneous Databases. In Third Inter-national Conference on Data and Knowledge Bases, pages 3–18,Jerusalem, Israel, 1988.

[KLSS95] T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava. The Informa-tion Manifold. In AAAI Symposium on Information Gathering inDistributed Heterogeneous Environments, 1995.

[KR99] Y. Kotidis and N. Roussopoulos. DynaMat: A Dynamic ViewManagement System for Data Warehouses. In ACM SIGMODInternational Conference on Management of Data, pages 371–382, Philadelphia, PA, June 1999.

[KWD97] N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper in-duction for information extraction. In Intl. Joint conference onAritificial Intelligence (IJCAI), pages 729–737, 1997.

BIBLIOGRAPHY 65

[Lev99a] A. Y. Levy. Combining Artificial Intelligence and Databases forData Integration. In Special issue of LNAI: Artificial IntelligenceToday; Recent Trends and Developments. Springer Verlag, 1999.

[Lev99b] A. Y. Levy. Logic-Based Techniques in Data Integration. InJ. Minker, editor, Workshop on Logic-Based Artificial Intelli-gence, Washington, DC, June 1999.

[Lev00] A. Y. Levy. Answering Queries Using Views: A Survey. Submit-ted for publication, 2000.

[Lit85] W. Litwin. An Overview of the Multidatabase System MRSDM.In ACM National Conference, pages 495–504, October 1985.

[LRO96] A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heteroge-neous Information Sources Using Source Descriptions. In Inter-national Conference on Very Large Data Bases (VLDB), pages251–262, Bombay, India, September 1996.

[LS93] A. Y. Levy and Y. Sagiv. Queries Independent of Updates. In In-ternational Conference on Very Large Data Bases (VLDB), pages171–181, Dublin, Ireland, August 1993.

[LSS93] L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. Onthe Logical Foundation of Schema Integration and Evolution inHeterogeneous Database Systems. In 2nd International Confer-ence on Deductive and Object-Oriented Databases, pages 81–100,Phoenix, AZ, 1993.

[Mit99] P. Mitra. Algorithms for Answering Queries Efficiently UsingViews. Technical report, Infolab, Stanford University, September1999.

[MMK98] I. Muslea, S. Minton, and C. Knoblock. Wrapper induction forsemistructured web-based information sources. In Conference onAutomatic Learning and Discovery CONALD-98, 1998.

[MRRS00] H. Mistry, P. Roy, K. Ramamritham, and S. Sudarshan. Materi-alized View Selection and Maintenance using Multi-Query Opti-mization. Submitted for publication, March 2000.

[MW88] N. E. Malagardis and T. J. Williams, editors. Standards in Infor-mation Technology and Industrial Control, chapter MultidatabaseSystems in ISO/OSI Environment, pages 83–97. North-Holland,Netherlands, 1988.

[ND95] S. Navathe and M. Donahoo. Towards Intelligent Integration ofHeterogeneous Information Sources. In Proceedings of the 6thInternational Workshop on Database Re-engineering and Inter-operability, 1995.

66 BIBLIOGRAPHY

[OV99] M. T. Ozsu and P. Valduriez. Principles of Distributed DatabaseSystems, chapter 4: Distributed DBMS Architecture. PrenticeHall, 1999.

[PGMA96] Y. Papakonstantinou, H. Garcia-Molina, and S. Abiteboul. Ob-ject fusion in mediator systems. In International Conference onVery Large Databases, Bombay, India, September 1996.

[PL00] R. Pottinger and A. Levy. A Scalable Algorithm for AnsweringQueries Using Views. In International Conference on Very LargeData Bases (VLDB), pages 484–495, Cairo, Egypt, September2000.

[QW97] D. Quass and J. Widom. On-Line Warehouse View Maintenance.In ACM SIGMOD International Conference on Management ofData, pages 393–404, Tucson, AZ, June 1997.

[Rea89] M. Rusinkiewicz and et. al. OMNIBASE: Design and Implemen-tation of a Multidatabase System. In 1st Annual Symposium inParallel and Distributed Processing, Dallas, Texas, May 1989.

[RSS96] K. A. Ross, D. Srivastava, and S. Sudarshan. Materialized ViewMaintenance and Integrity Constraint Checking: Trading Spacefor Time. In ACM SIGMOD International Conference on Man-agement of Data, pages 447–458, Montreal, Canada, June 1996.

[SDN98] A. Shukla, P. M. Deshpande, and J. F. Naughton. MaterializedView Selection for Multidimensional Datasets. In InternationalConference on Very Large Data Bases (VLDB), pages 488–499,New York City, NY, August 1998.

[SG90] T. K. Sellis and S. Ghosh. On the Multiple-Query OptimizationProblem. IEEE Transactions on Knowledge and Data Engineer-ing (TKDE), 2(2):262–266, June 1990.

[SL90] A. P. Sheth and J. A. Larson. Federated Database Sys-tems for Managing Distributed, Heterogeneous and AutonomousDatabases. ACM Computing Surveys, 22(3):183–236, 1990.

[TBC+87] T. Templeton, D. Brill, A. Chen, S. Dao, E. Lund, R. Macgregor,and P. Ward. Mermaid: A Front-End to Distributed Heteroge-neous Databases. In International Conference on Data Engineer-ing, pages 695–708, 1987.

[Ull97] J. D. Ullman. Information Integration using Logical Views. InInternational Conference on Database Theory (ICDT), pages 19–40, Delphi, Greece, January 1997.

[Var99] A. Vargun. Semantic Aspects of Heterogeneous Databases, 1999.

BIBLIOGRAPHY 67

[VP98] V. Vassalos and Y. Papakonstantinou. Using Knowledge of Re-dundancy for Query Optimization in Mediators. In Workshop onAI and Information Integration (in conjunction with AAAI’98),Madison, WI, July 1998.

[XML00] Extensible Markup Language (XML) 1.0. W3C Recommendation,October 2000. http://www.w3.org/TR/REC-xml.

[YKL97] J. Yang, K. Karlapalem, and Q. Li. Algorithms for MaterializedView Design in Data Warehousing Environment. In InternationalConference on Very Large Data Bases (VLDB), pages 136–145,Athens, Greece, August 1997.

[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. ViewMaintenance in a Warehousing Environment. In ACM SIGMODInternational Conference on Management of Data, pages 316–327,San Jose, CA, June 1995.

Date post:	10-May-2018
Category:	Documents
Upload:	buiminh
View:	217 times
Download:	1 times

Data Integration Services - Massachusetts Institute of … a data integration system facilitates...

Documents