ii
ABSTRACT
The developed system manages data in the peer databases to answer queries
related to several databases in the purview. Data integration tools can be used for the same
purpose but they suffer from two problems- they require comprehensive schema design
before they can be used (overhead) and they are difficult to be extended since they typically
breakdown the backward compatibility. The developed system used coordination rules or
mappings and the Query Reformulation Algorithm to manage data in the peers and provide
the required data to the end user. The algorithm and coordination rules also make the
system feasible and easily extendible.
iii
TABLE OF CONTENTS
Abstract 2
Table of contents 3
List of Figures 5
1. Background and Rationale 6
1.1. Introduction 6
1.2. Existing Methods 8
1.2.1. File based P2P Systems 9
1.2.2. Mediator based Integration -GAV and LAV 10
1.2.3. Peer to Peer Integration - Introducing PeerDB, Hyperion and Piazza 11
1.3. Advantages of the New System 16
2. Narrative 17
2.1. A Simple Peer to Peer System 17
2.2. Peer to Peer Data Placement Problem 18
2.3. Data Placement Design Choices 19
2.3.1. Scope of Decision Making 20
2.3.2. Extent of Knowledge Sharing 21
2.3.3. Heterogeneity of Information Sources 20
2.3.4. Dynamicity of Participants 20
2.4. How Piazza Works 21
2.4.1. Query Optimization Exploiting Commonalities and Available Data 22
2.4.2. Propagating Information about Materialized Views 23
2.4.3. Consolidating Query Evaluation and Data Placement 23
iv
2.4.4. Schema Mediation in Piazza 24
2.4.5. Schema Mediation in Piazza 27
2.5. JXTA 28
2.5.1. JXTA Jorgan 30
3. System Design 36
3.1. System Requirements 36
3.2. Piazza Algorithm 37
3.3. P2P Database and Coordination Rules 42
3.4. Workflow 44
4. Evaluation and Results 49
4.1. Evaluation 49
4.2. Results 50
5. Future Work 52
6. Conclusion 53
7. Bibliography and References 54
8. Appendix 58
v
LIST OF FIGURES
Figure 2.1 P2P Architecture: Logical view 19
Figure 2.2 Piazza System Architecture 24
Figure 3.2.1 Flow chart of the Query Reformulation Algorithm 39
Figure 3.2.2 Rule-Goal Tree 42
Figure 3.3.1 Topology 41
Figure 3.4.1 Status Window 45
Figure 3.4.2 Interface for Each Node (batch file) 46
Figure 3.4.3 Status during Coordination Rules Announcement 47
Figure 3.4.4 Peers are Ready 47
Figure 3.4.5 Execution of Query 1 48
Figure 3.4.5 Execution of Query 2 49
Table 4.2.1 Results 52
6
1. BACKGROUND AND RATIONALE
1.1 Introduction
Users are equipped to access a multitude of data sources that are related in some
way and to combine the returned data to come up with useful information which is not
physically stored in a single space. For instance, a person who has the intension of buying
a car can query several car dealer Web sites and then compare the results. He can further
query a data source which provides information about car reviews to help his decision
about the cars he liked. As another example, imagine a company which has several
branches in different cities. Each branch has its own local database recording its sales.
Whenever global decisions about the company have to be made, each branch database
must be queried and the results must be combined. On the other hand, contacting data
sources individually and then combining the results manually every time information is
needed is a very tedious task.
Instead, a service is needed which provides transparent access to a collection of
related data sources as if these sources as a whole constituted a single data source. Such a
service is called a data integration service and the system that integrates multiple sources
to provide this service is usually referred to as a data integration system. The main
contribution of a data integration system is that users can focus on specifying what data
they want rather than on describing how to obtain it. A data integration system relieves
the user from the burden of finding the relevant data sources, interacting with each of
them separately, then combining the data they return. To achieve this, the system
provides an integrated view of the data stored in the underlying data sources. Users can
uniformly access all the data sources as if they were querying a single data source.
7
Also, environmental, hydrographic, meteorological and oceanographic data have
been collected and made available by numerous local, state and federal agencies as well
as by universities. Currently users have to manually interact with these large collections
of internet data sources, determine which ones to access and how to access and manually
merge results from different data sources which is tedious and cumbersome process and
hence a data integration system is required to answer such type of queries. Some
examples of areas in which integration is much useful are – Science and Culture:
Integrating Genomic data, Monitoring events in the sky, Puget Sound Regional Synthesis
Model; Enterprise data integration; World-wide web: XML integration, comparison
shopping etc.
Medical System in India is not so well organized in the aspect of rendering
services to people who live under Below Poverty Line (BPL). Most of the people living
in villages are deprived of advanced medical technology mainly due to the lack of
promptness in delivering required help. Villagers have to travel from their location to
cities to get medical checkup or blood tests etc which involves money, time and effort. If
the government could provide fair transportation system, with the advent of data
integrating approaches, we can cater to the dire needs of the people immediately. For
instance, if we can set up an emergency station in the city and provide its number to all,
any serious incident can be reported to it from any village and let the service take care of
the patient. All the required data like hospital data, clinical laboratory for blood test, fire
station etc can be accessible at the emergency station looking at which one can decide
which hospital the patient can be taken to. If fire accident happens at any location, the
emergency station service can pick up all the victims and place them in the nearby
8
hospitals besides sending firemen to control the fire by just looking at the available data
from different sources at the station. This project has been developed to address this
problem and at least ease it to certain extent.
1.2 Existing Methods
A long-standing tenet of distributed systems is that the strength of a distributed
system can grow as more hosts participate in it. Each participant may contribute data and
computing resources (such as unused CPU cycles and storage) to the overall system, and
the wealth of the community can scale with the number of participants. A peer-to-peer
(P2P) distributed system is one in which participants rely on one another for service, rather
than solely relying on dedicated and often centralized infrastructure. Instead of strictly
decomposing the system into clients (which consume services) and servers (which provide
them), peers in the system can elect to provide services as well as consume them. The
membership of a P2P system is relatively unpredictable: service is provided by the peers
that happen to be participating at any given time [Rabinovich 1998]. At first glance, many
of the challenges in designing P2P systems seem to fall clearly under the banner of the
distributed systems community. However, upon closer examination, the fundamental
problem in most P2P systems is the placement and retrieval of data. Indeed, current P2P
systems focus strictly on handling semantics-free, large-granularity requests for objects by
identifier (typically a name), which both limits their utility and restricts the techniques that
might be employed to distribute the data. Most of the integration techniques used currently
can be categorized under three roofs – Content or file based integration system where
communication in peers is achieved through file sharing, Mediator based integration where
global and local schemas are defined in terms of one another to achieve data
9
communication and P2P integration using cutting edge query reformulation algorithms like
Piazza [Halevy 2003], PeerDB [Ives 2000] etc which avoids any compromise on peer
autonomy besides providing data coordination among peers.
1.2.1 File Based P2P Systems
Many examples of P2P systems have emerged recently, most of which are wide-
area, large-scale systems that provide content sharing [Napster 2001], storage services
[Kubiatowicz 2000], or distributed “grid” computation [Legion 2000]. Smaller-scale P2P
systems also exist, such as federated, server less file systems and collaborative workgroup
tools. The success of these systems has been mixed; some, such as Napster, have enjoyed
enormous popularity and perform well at scale.
Others, including Gnutella, have failed to attract a large community, possibly due
to a combination of weak application semantics and technical flaws that limit its scaling.
Perhaps the most exciting possibility of peer-to-peer computing is that the desirable
properties of the system can become amplified as new peers join: because of its
decentralization, the system’s robustness, availability, and performance might grow with
the number of peers. A more subtle possibility is that the richness and diversity of the
system can similarly scale, since new peers can introduce specialized data or resources that
the system was previously lacking. Decentralization also helps eliminate proprietary
interests in the system’s infrastructure; instead of trust being placed in dedicated servers,
trust is diffused over all participants in the system. The need for administration is
diminished, since there is no dedicated infrastructure to manage. By routing requests
through many peers and replicating content, the system might be able to hide the identity of
content publishers and consumers, making it resilient against censorship.
10
Although the vision of P2P systems is grand, the technical challenges associated
with them are immense, and as a result the realization of the vision has been elusive.
Because the membership in the system is ad-hoc and dynamic, it is very difficult to predict
or reason about the location and quality of the system’s resources. For example, the
placement of data in content-sharing systems is often naive: data placement is largely
demand driven, with little regard given to network bandwidth, load, or historical
trustworthiness of the peer on which the data is placed. Because the system is
decentralized, any optimizations such as data placement must be done in a completely
distributed manner; the system cannot necessarily presume the existence of a single oracle
that coordinates the activity of all of the systems’ peers [Siong 2003]. Furthermore, the
dynamic nature of the system may impose fundamental limitations on its data consistency
and availability: if the rate at which data changes in the system is high, then the overhead
of maintaining globally accessible indexes may become prohibitive as the number of peers
in the system grows. Because P2P systems designers have to a large extent failed to
overcome these challenges, the semantics provided by these systems is typically quite
weak. In most content sharing systems, only popular content is readily accessible - yet
content popularity seems to be driven by distributions, in which a large fraction of requests
are directed to unpopular content. Similarly, current content sharing systems ignore
problems such as updates to content, and they typically only support retrieval of objects by
name. These current content sharing systems are largely limited to applications in which
objects are large, opaque, and atomic, and whose content is well-described by their name;
for instance, today’s P2P systems would be highly ineffective at content-based retrieval of
text files or at fetching only the abstracts from a set of LATEX documents. Moreover, they
11
are limited to caching, pre-fetching, or pushing of content at the object level, and know
nothing of overlap between objects. These limitations arise because the P2P world is
lacking in the areas of semantics, data transformation, and data relationships, yet these are
some of the core strengths of the data management community.
Queries, views, and integrity constraints can be used to express relationships
between existing objects and to define new objects in terms of old ones. Complex queries
can be posed across multiple sources, and the results of one query can be materialized and
used to answer other queries. Data management techniques such as these can be used to
develop better solutions to the data placement problem at the heart of any P2P system
design: data must be placed in strategic locations and then used to improve query
performance. The database field will benefit from the results, as new query processing
systems can leverage the increased scalability, reliability, and performance of a successful
P2P architecture [Doan 2002].
1.2.2 Mediator Based Integration - GAV and LAV
In recent years, there have been researches in developing tools that facilitate the
rapid integration of heterogeneous information sources that may include both structured
and unstructured data. A common problem facing many organizations today is that of
multiple, disparate, object stores, knowledge bases, file systems, digital libraries,
information retrieval systems, and electronic mail systems. Decision makers often need
information from multiple sources, but are unable to get and use the required information in
a timely fashion due to the difficulties of accessing the different systems, and due to the
fact that the information obtained can be inconsistent and contradictory. There are basically
two approaches for designing a data integration system. In the global-as-view approach,
12
one defines the concepts in the global schema as views over the sources, whereas in the
local-as view approach, one characterizes the sources as views over the global schema.
The recent trend in data integration has been to loosen the coupling between data.
Here the idea is to provide a uniform query interface over a mediated schema. This query is
then transformed into specialized queries over the original databases. This process can also
be called as view based query answering because we can consider each of the data sources
to be a view over the (nonexistent) mediated schema. Formally such an approach is called
Local As View (LAV) — where "Local" refers to the local sources/databases. An alternate
model of integration is one where the mediated schema is designed to be a view over the
sources. This approach called Global As View (GAV) — where "Global" refers to the
global (mediated) schema — is often used due to the simplicity involved in answering
queries issued over the mediated schema. However, the obvious drawback is the need to
rewrite the view for mediated schema whenever a new source is to be integrated and/or an
existing source changes its schema.
Data integration systems are formally defined as a triple (G, S, M) where G is the
global (or mediated) schema, S is the set of heterogeneous source schemas, and M is the
mapping that maps queries between the source and the global schemas. Both G and S are
expressed in languages over alphabets comprised of symbols for each of their respective
relations. The mapping M consists of assertions between queries over G and queries over S.
In GAV, the global schema is modeled as a set of views over S. In this case M associates to
each element of G a query over S. Query processing becomes a straightforward operation
because the associations between G and S are well-defined. The burden of complexity is
placed on implementing mediator code instructing the data integration system exactly how
13
to retrieve elements from the source databases. If any new sources are added to the system,
considerable effort may be necessary to update the mediator, thus the GAV approach
should be favored in cases where the sources are not likely to change. In LAV, the source
database is modeled as a set of views over G. In this case M associates to each element of S
a query over G. Here the exact associations between G and S are no longer well-defined.
As is illustrated in the next section, the burden of determining how to retrieve elements
from the sources is placed on the query processor. The benefit of an LAV modeling is that
new sources can be added with far less work than in a GAV system, thus the LAV
approach should be favored in cases where the mediated schema is not likely to change.
Modeling Websites often require expressive power of GAV and LAV. Hence
GLAV is developed which is a language for source description that is more expressive than
GAV and LAV combined. Query answering for GLAV sources is no harder than it is for
LAV sources. GLAV reaches the limits on the expressive power of a data source
description language. GLAV is also of interest for data integration independent of data
webs, because of the flexibility it provides in integrating diverse sources.
1.2.3 Peer to Peer Integration - Introducing PeerDB, Hyperion and Piazza
In current data sharing P2P systems, only file-system-like capabilities are
provided while the semantics of data is largely ignored. For example, in Gnutella, queries
are restricted to strings that can be contained in a filename and directory path, that is,
only simple value searches on file names are supported. Peer-based data management
system can be seen as a distributed and heterogeneous database system, the scale of the
system and its dynamism as nodes join and leave the network offer several major
challenges [Napster 2001]. First, there is no predefined global schema. With each node
14
joining and leaving the network at anytime, assuming a global schema in such a dynamic
environment is apparently not practical, scalable and extensible. One possible approach is
to perform “mapping” on-the-fly during querying. Second, realizing efficient query
processing becomes more difficult. Initial response time is expected to be high as relevant
data have to be identified before any optimization and query processing can be
performed. Third, much information redundancy exists in the network, which inevitably
brings about data and computation redundancy. Unfortunately, information redundancy
cannot be avoided unless some control over data placement is taken. Finally, the notions
of correctness and completeness of query results cannot be used in their pure meaning as
in traditional database systems.
PeerDB is a P2P based system for distributed data sharing. PeerDB has several
distinguishing features. First, each participating node is a full fledge object management
system that supports content-based search. Second, in PeerDB, users can share data
without a shared global schema. Third, PeerDB adopts mobile agents to assist in query
processing. Since agents can perform operations at the peers’ sites, the network
bandwidth is better utilized. More importantly, agents can be coded to perform a wide
variety of tasks, making it easy to extend the capabilities of a PeerDB node [Ives 2000].
There is an another architecture for peer data base management systems (PDMS)
that instantiates the vision of logical P2P data coordination laid out in which is called
Hyperion. A PDBMS is a conventional DBMS augmented with a P2P interoperability
layer. This layer implements the functionality required for peers to share and coordinate
data without compromising their own autonomy. The P2P layer allows a PDBMS to
establish or abolish an acquaintance (semi-)automatically at runtime, thereby inducing a
15
logical peer-to-peer network. The two important aspects of this system are data
coordination in which each source behaves as an access point for both local and shared
data, data sharing both within and across domains, while views and GLAV (global-and
local-as-view) mappings have been used to integrate and exchange data within a common
domain.
The other peer system predominantly based on ontology and which has not been
completely implemented is Piazza. Piazza paves the way for a fruitful combination of
data management and knowledge representation techniques in the construction of the
semantic web [Halevy 2003]. In fact, the techniques offered in Piazza are not a
replacement for rich ontologies and languages for mapping between ontologies but is to
provide the missing link between data described using rich ontologies and the wealth of
data that is currently managed by a variety of tools. In order to exploit data from other
sites, there must be semantic glue between the sites, in the form of semantic mappings.
Mappings in Piazza are specified between a small numbers of sites, usually pairs. In this
way, it is possible to support the two rather different methods for semantic mediation -
mediated mapping, where data sources are related through a mediated schema or
ontology, and point-to-point mappings, where data is described by how it can be
translated to conform to the schema of another site.
1.3 Advantages of the New System
Ultimate goal with Piazza is to provide query answering and translation across the
full range of data. Logically, a Piazza system consists of a network of different sites (also
referred to as peers or nodes), each of which contributes resources to the overall system.
The resources contributed by a site include one or more of the following: (1) ground or
16
extensional data, (2) models of data. In addition, nodes may supply computed data, i.e.,
cached answers to queries posed over other nodes. When a new site (with data instance or
schema) is added to the system, it is semantically related to some portion of the existing
network. Queries in Piazza are always posed from the perspective of a given site's schema,
which defines the preferred terminology of the user. When a query is posed, Piazza
provides answers that utilize all semantically related data within the system [Halevy 2003].
In order to exploit data from other sites, there must be coordination rule between
the sites, in the form of mappings. Mappings in Piazza are specified between small
numbers of sites, usually pairs. In this way, it is possible to support the two rather different
methods for schema mediation mentioned earlier: mediated mapping, where data sources
are related through a mediated schema or ontology, and point-to-point mappings, where
data is described by how it can be translated to conform to the schema of another site.
Admittedly, from a formal perspective, there is little difference between these two kinds of
mappings, but in practice, content providers may have strong preferences for one or the
other.
17
2. NARRATIVE
The goal of the peer to peer data integration using semantic rules is to address this
need: the use of a decentralized, easily extensible data management architecture in which
any user can contribute new data, schema information, or even mappings between other
peers’ schemas.
2.1 A Simple Peer to Peer System
A peer to peer (or P2P) computer network uses diverse connectivity between
participants in a network and the cumulative bandwidth of network participants rather than
conventional centralized resources where a relatively low number of servers provide the
core value to a service or application. P2P networks are typically used for connecting nodes
via largely ad hoc connections. Such networks are useful for many purposes. Sharing
content files containing audio, video, data or anything in digital format is very common,
and realtime data, such as telephony traffic, is also passed using P2P technology. The
concept of P2P is increasingly evolving to an expanded usage as the relational dynamic
active in distributed networks, i.e. not just computer to computer, but human to human.
Yochai Benkler has coined the term "commons-based peer production" to denote
collaborative projects such as free software. Associated with peer production are the
concept of peer governance (referring to the manner in which peer production projects are
managed) and peer property [Franconi 2003]. A logical view of P2P architecture is shown
in Figure 2.1.
18
Figure 2.1 P2P architecture: logical view.
2.2 Peer to Peer Data Placement Problem
The data placement problem for a P2P system is as follows. Assume we are given
a set of cooperating nodes connected by a network (typically, but not necessarily, the
Internet) that has limited bandwidth on each link. Nodes know about and exchange data
with a collection of participating peers, and they may serve any or all of four roles [Suciu
2003]. The first of these is a data origin, which provides original content to the system and
is the authoritative source of that data. As a storage provider, a peer stores materialized
views (consuming disk resources, and perhaps replacing previously materialized views if
there is insufficient space), and as a query evaluator, it uses a portion of its CPU resources
to evaluate the set of queries forming its workload. As query initiators, peers act as clients
in the system and pose new queries. (A node may initiate new queries on behalf of a query
it is attempting to evaluate.) The overall cost of answering a query includes the transfer
cost from the storage provider or data origin to the query evaluator, the cost of resources
utilized at the query evaluator and other nodes, and the cost to transfer the results to the
19
query initiator. The data placement problem is to distribute data and work so the full query
workload is answered with lowest cost under the existing resource and bandwidth
constraints. While a cursory glance at the data placement problem suggests many
similarities with multi-query optimization in a distributed database, there are substantial
and fundamental differences. For example, in the general case, a P2P system has no
centralized schema and no central administration.
2.3 Data Placement Design Choices
While the globally optimal peer-to-peer concept is conceptually simple to define
for an ideal environment, in practice any P2P system will have certain limitations. These
compromises are due to factors such as constrained bandwidth and resources, message
propagation delays, and so on. Some important dimensions that affect the data placement
problem include:
2.3.1 Scope of Decision-Making
A major factor is the scale at which query processing and view materialization
decisions are made. At one extreme, all queries in the entire system are optimized together,
using complete knowledge of the available materialized views, resources, and network
bandwidth constraints — this poses all of the challenges of multi-query optimization plus a
number of additional difficulties. In particular, work must be distributed globally across
many peers, and decisions must be made about when and where to materialize results for
future use. At the other end of the spectrum, every decision is made on a single-node,
single-query basis — this is the familiar problem of query optimization for distributed data.
Clearly, a good query optimization and data placement strategy will be much more
beneficial to the global system than the local one; yet decisions are likely to be much more
20
expensive to make on the global scale, so any real system will likely be forced to work
within a smaller scope.
2.3.2 Extent of Knowledge Sharing
Related to the above problem is the question of how much knowledge is available
to the system during its query optimization process. In particular, the first step in choosing
a query evaluation strategy is likely to be identifying which nodes have materialized views
that can speed query processing. A simple technique would be to use a centralized catalog
of all available views and their locations, analogous to the central directory used by
Napster.
2.3.3 Heterogeneity of Information Sources
Data may originate at a few authoritative sources, or alternatively, every
participant might be allowed (or expected) to contribute data to the community. The level
of heterogeneity of the data influences the degree to which a system can ensure uniform,
global semantics for the data. A P2P system might impose a single schema on all
participants to enforce uniform, global semantics, but for some applications this will be too
restrictive. Alternatively, a limited number of data sources and schemas may be allowed, so
traditional schema and data integration techniques will likely apply (with the restriction
that there is no central authority). The case of fully heterogeneous data makes global
semantic integration extremely challenging.
2.3.4 Dynamicity of Participants
Some P2P systems, such as [Napster 2001], assume a fixed set of nodes in the
system. However, one of the greatest potential strengths of P2P systems is when they
eschew reliance on dedicated infrastructure and allow peers to leave the system at will.
21
Even under these conditions, participants typically have broadly varying availability
characteristics. Some peers are akin to servers: their membership in the system stays
largely static. Others have much more dynamic membership, joining and leaving the
system at will. In a configuration where original data is distributed uniformly across the
network, including on nodes that frequently disappear, it may become impossible to
reliably access certain items. At the other extreme, if all data is placed or cached only on
the set of static “servers,” the system will have greatly reduced flexibility and performance
(this configuration is equivalent to yesterday’s web, prior to proxy caches and content
distributors such as Akamai). An intermediate approach places all original content on the
consistently available nodes to provide availability, but replicates or caches data at the
dynamic peers.
2.4 How Piazza Works
Piazza algorithm focuses on the dynamic data placement problem mentioned
above with goals as scalability even with large numbers of nodes and moderately frequent
updates. Figure 2.2 shows data origin as an entity distinct from the peers in the system
(though a peer can actually serve both roles) — Piazza can only guarantee availability of
data while its origin is a member of the network, and only the origin may update its data.
All peer nodes belong to spheres of cooperation, in which they pool their resources and
make cooperative decisions. Each sphere of cooperation may in turn be nested within a
successively larger sphere, with which it cooperates to a lesser extent. These spheres of
cooperation will often mirror particular administrative boundaries (e.g. those within a
corporation or local ISP), and in many ways resemble a cooperative cache. Given this
configuration, Piazza focuses on the following aspects of the data placement problem:
22
2.4.1 Query Optimization Exploiting Commonalities and Available Data
At the heart of our problem lies a variation of traditional multi-query
optimization. Ideally, the Piazza system will take the current query workload, find
commonalities among the queries, exploit materialized views whenever cost-effective,
distribute work under resource and bandwidth constraints, and determine whether certain
results should be materialized for future use (while considering the likelihood of updates to
the data). For scalability reasons, these decisions are taken at the level of a sphere of
cooperation rather than on a global basis. In order to perform this optimization, Piazza must
address two important sub-problems [Halevy 2003][Suciu 2003].
2.4.2 Propagating Information about Materialized Views
When a query is posed, the first step is to consider whether it can be answered
using the data at “nearby” storage providers, and to evaluate the costs of doing so. This
requires the query initiator to be aware of existing materialized views and properties such
as location and data freshness. One direction we are exploring is to propagate information
about materialized views using techniques derived from routing protocols [Tanenbaum
1996]. In particular, a node advertises its materialized views to its neighbors. Each node
consolidates the advertisements it receives and propagates them to its neighbors. Under
constrained resources, any node can arbitrarily drop advertisements without jeopardizing
system correctness— a query can always be posed in terms of the data origins. This routing
protocol avoids the scalability problems caused by broadcasting every view materialization
and those caused by broadcasting every query request.
23
2.4.3 Consolidating Query Evaluation and Data Placement
A node may pose a query that cannot be evaluated with the data available from
known peers. In this case, the data must be retrieved directly from the data origins.
However, at any given point, there may be many similar un-evaluable queries within the
same sphere of cooperation, and the sphere should choose an optimal strategy for
evaluating all of them. Therefore, all un-evaluable queries are broadcast within the cluster;
the cluster identifies commonalities among this query set, then assigns roles (evaluation of
a query or sub query and/or materialization of results) to specific nodes based on cost.
Figure 2.2 Piazza System Architecture [Doan 2002].
Data Origins serve original content, peer nodes (A-E) cooperate to answer queries
but have limited disk and CPU resources. Nodes are connected by band-width constrained
links and advertize their materialized views. Nodes belong to spheres of cooperation with
which they share resources; these spheres may be nested within successively larger spheres
(see Figure 2.2).
2.4.4 Schema Mediation in Piazza
In contrast to a data integration environment, which has a tree-based hierarchy
with data sources schemas at the leaf nodes and one or more mediated schemas as
24
intermediate nodes, a peer data management system (PDMS) can support an arbitrary
graph of interconnected schemas. Some of these schemas are defined virtually for
purposes of querying and mapping. These are called peer schemas, and generally their
relations (peer relations) will have an open-world assumption (i.e., the data returned by
querying these relations may be incomplete). Queries in the PDMS will be posed over the
relations from a specific peer schema. A peer schema represents the peer’s “view of the
world” that is unlikely to be the same at different peers. Peers may also contribute data to
the system in the form of stored relations. Stored relations are analogous to data sources
in a data integration system: all queries in a PDMS will be reformulated strictly in terms
of stored relations that may be stored locally or at other peers [Suciu 2003].
There are two types of schema mappings in Piazza. A mapping that relates two or
more peer schemas is called a peer description, whereas a mapping that relates a stored
schema to a peer schema is called a storage description. Peer descriptions define the
correspondences between the “views of the world” at different peers. Storage
descriptions, on the other hand, map the data stored at a peer into the peer’s view of the
world. Thus, storage descriptions are similar to data source descriptions in a data
integration system.
Two main formalisms have been proposed for schema mediation in data
integration systems. In the first, called global as- view (GAV), the relations in the
mediated schema are defined as views over the relations in the sources. In the second,
called local-as-view (LAV), the relations in the sources are specified as views over the
mediated schema. For example, Let us assume there are two data sources - two car dealer
25
databases which both became parts of Acme Cars company. Each of the car dealers has a
separate schema for storing information about cars. Dealer 1 stores it in the relation:
Cars(vin, make, model, color, price)
Dealer 2 stores information about his cars for sale in the relation:
CarsForSale(vehicleID, carMake, carModel, carColor, carPrice).
Acme Cars uses a mediated architecture to integrate these two dealers' databases.
It does this by providing a mediated schema of the two schemas above. The mediated
schema consists of just one relation:
Automobiles(vin, autoMake, autoModel, autoColor, autoPrice).
In GAV approach, for each relation R in the mediated schema, a view in terms of the
source relations is written which species how to obtain R's tuples from the sources.
The following simple example shows how mediated schema relations CAR and REVIEW
can be obtained from the source relations S1, S2 and S3.
S1(vin, status, model, year) => CAR(vin, status)
S2(vin, status, make, price) => CAR(vin, status)
S1(vin, status, model, year) ∩ S3(vin, review) => REVIEW(vin, review)
S2(vin, status, make, price) ∩ S3(vin, review) => REVIEW(vin, review)
In LAV approach, for each data source S, a view in terms of the mediated schema
relations is written that describes which tuples of the mediated schema relations are found
in S. In LAV, we take an opposite approach to GAV and we describe each source in
terms of the mediated schema relations. Assume that source S1 contains cars produced
after 1990 and source S2 contains cars sold by the dealer "ACME".
S1(vin, status, model, year) : − CAR(vin, status), MODEL(vin, model, year), year
26
≥1990
S2(vin, status, make, price) : − CAR(vin, status), MODEL(vin, make, year),
SELLS(dealer name, vin, price), dealer name = "ACME"
S3(vin, review) : − REVIEW(vin, review)
Query processing using the LAV approach is an application of a much broader problem
called "Answering Queries using Views" [Franconi 2003].
Piazza combines and generalizes the two data integration formalisms, and it
extends them to the XML world in a way that keeps evaluation tractable. Two kinds of
peer descriptions are supported: equality and inclusion descriptions. Peer descriptions
have the following form: Q1(P1) = Q2(P2), (or Q1(P1) subset Q2(P2) for inclusions)
where Q1 and Q2 are conjunctive queries with the same arity and P1 and P2 are sets of
peers. Intuitively, the mapping statement specifies a semantic mapping by stating that
evaluating Q1 over the peers P1 will always produce the same answer (or a subset in the
case of inclusions) as evaluating Q2 over P2. The set of mappings of a PDMS defines its
semantic network (or topology). Optimizing the topology of a PDMS is an interesting
research problem. Some of the possible optimization criteria include: eliminating
redundant mappings, reducing the diameter of a PDMS (to reduce information loss in
query reformulation), and identifying semantically unreachable peers.
2.4.5 Querying in Piazza
Query reformulation is perhaps the single most important aspect of query
processing in a PDMS, since it is crucial for PDMS’s ability to answer user queries. The
input of the algorithm is a set of peer mappings and storage descriptions and a query Q.
27
The output of the algorithm is a query expression Q0 that refers to stored relations only.
To answer Q we need to evaluate Q0 over the stored relations.
The algorithm proceeds by constructing a simple rule-goal tree: goal nodes are
labeled with atoms of the peer relations, and rule nodes are labeled with peer mappings. It
begins by expanding each query subgoal according to the relevant peer mappings in the
PDMS. When none of the leaves of the tree can be expanded any further, it uses the
storage descriptions for the final step of reformulation in terms of the stored relations.
Suppose all peer mappings in the PDMS are of the form V subset Q(P). In this case (that
is similar to LAV mappings in data integration), we begin with the query subgoals and
apply an algorithm for answering queries using views. The algorithm is applied to the
result until it cannot proceed further, and as in the previous case, it used the storage
descriptions for the last step of reformulation [Halevy 2003].
A major challenge of the reformulation algorithm is to combine and interleave the
two types of reformulation techniques. One type of reformulation (unfolding) replaces a
sub goal with a set of sub goals, while the other (rewriting) replaces a set of sub goals
with a single sub goal. As a result, the output of the algorithm can be a acyclic graph
rather than a tree.
2.5 JXTA
It is a set of open, generalized peer-to-peer protocols that allows any connected
device (cell phone, PDA, PC to server) on the network to communicate and collaborate in
p2p manner. It is an open source product. JXTA technology enables developers to create
innovative distributed services and applications. JXTA technology is used to create
applications and services that enable people to:
28
• Collaborate on projects from anywhere using any connected device
• Share compute services, such as processor cycles or storage systems, regardless of
where the systems or the users are physically located
• Communicate with colleagues across the world using a peer-to-peer network
• Share files and information to distributed locations on the network, not just to local
hard drives
• Connect game systems so that multiple people in multiple locations can play the
same game interactively.
There are the obvious, such as messaging and resource sharing, but as the
deployments prove collaboration, content delivery, and decentralization are ripe for P2P
applications. JXTA technology provides developers with tools to build network
applications that thrive in highly dynamic environments. "There's not so much one classic,
killer application," Soto says. "But there are killer characteristics that make an application
suitable for JXTA technology." [Berners 2000].These characteristics include situations:
• where centralization is not required or not possible
• where resilience is needed--in case a piece of the network is lopped off, for example
• where massive scalability is important--peers could pick up large pieces of the load
on the network--the more peers in the network the more valuable the P2P solution is
• where relationships are transient or ad hoc
• where resources are highly distributed
The JXTA Release 2.0 adds new features that enhance scalability and
performance. All are aimed at making JXTA technology more and more enterprise ready.
29
"JXTA greatly reduces the complexity required to build and deploy P2P solutions and
services," says Soto. "Businesses benefit greatly as a result: improved collaboration and
sharing, greater security and resilience because there's no single point of failure, up-to-the-
second data currency, and better control. And that means lower costs and faster time to
market for improved competitiveness." JXTA strives to provide a base P2P infrastructure
over which other P2P applications can be built. This base consists of a set of protocols that
are language independent, platform independent, and network agnostic (that is, they do not
assume anything about the underlying network). These protocols address the bare
necessities for building generic P2P applications. Designed to be simple with low
overhead, the protocols target, to quote the JXTA vision statement, "every device with a
digital heartbeat." [Bolosky 2000]. JXTA currently defines six protocols, but not all JXTA
peers are required to implement all six of them. The number of protocols that a peer
implements depends on that peer's capabilities; conceivably, a peer could use just one
protocol. Peers can also extend or replace any protocol, depending on its particular
requirements. It is important to note that JXTA protocols by themselves do not promise
interoperability. Here, you can draw parallels between JXTA and TCP/IP. Though both
FTP and HTTP are built over TCP/IP, you cannot use an FTP client to access Webpages.
The same is the case with JXTA. Just because two applications are built on top of JXTA
doesn't mean that they can magically interoperate. Developers must design applications to
be interoperable. However, developers can use JXTA, which provides an interoperable
base layer, to further reduce interoperability concerns.
30
2.5.1 The JXTA Jargon
Before proceeding any further, let's quickly look at the various concepts in JXTA
[Bolosky 2000].
Peers
Any entity on the network implementing one or more JXTA protocols. A peer
could be anything from a mainframe to a mobile phone or even just a motion sensor. A
peer exists independently and communicates with other peers asynchronously.
Peer groups
Peers with common interests can aggregate and form peer groups. Peer groups
can span multiple physical network domains.
Messages
All communication in the JXTA network is achieved by sending and receiving
messages. These messages, called JXTA messages, adhere to a standard format, which is
key to interoperability.
Pipes
Pipes establish virtual communication channels in the JXTA environment. Peers
use them for sending and receiving JXTA messages. Pipes are deemed virtual because
peers don't need to know their actual network addresses to use them. That is an important
abstraction.
Services
Both peers and peer groups can offer services. A service offered by a peer
individually, at a personal level, is called a peer service, a concept equivalent to
31
centralization. No other peer needs to offer that service; if the peer is not active, the service
might become unavailable.
Peer groups offer services called peer group services. Unlike peer services, these
services are not specific to a single peer but available from multiple peers in the group.
Peer group services are more readily available, because even if one peer is unavailable,
other peers offer the same services.
Advertisements
An advertisement publishes and discovers any JXTA resource such as a peer, a
peer group, a pipe, or a codat. Advertisements are represented as XML documents.
Identifiers
Identifiers play a key role in the JXTA environment. Identifiers specify
resources, not physical network addresses. The JXTA identifier is defined as a URN
(Uniform Resource Name). A URN is nothing but a URI (Uniform Resource Identifier)
that has to remain globally unique and persistent even when the resource ceases to exist.
Endpoints
Endpoints are destinations on the network and can be represented by a network
address. Peers don't generally use endpoints directly; they use them indirectly through
pipes, which are built on top of endpoints.
Routers
Anything that moves packets around the JXTA network is called a JXTA router.
Not all peers need to be routers. Peers that are not routers must find a router to route their
messages.
32
The JXTA protocols
The key to JXTA lies in a set of common protocols defined by the JXTA
community. These protocols can be used as a foundation to build applications. Designed
with a low overhead, the protocols assume nothing about the underlying network topology
over which an application that uses them is built [Berners 2000].
Peer Discovery Protocol (PDP)
Peers use this protocol to discover all published JXTA resources. Since
advertisements represent published resources, PDP essentially helps a peer discover an
advertisement on other peers. As the lowest-level discovery protocol, PDP provides a basic
mechanism for discovery. Applications might choose to use higher-level discovery
mechanisms. PDP serves as a low-level protocol over which higher-level discovery
mechanisms can be built.
Peer Resolver Protocol (PRP)
Often in the network, peers send queries to other peers to locate some service or
content. The Peer Resolver Protocol intends to standardize these queries' formats. With this
protocol, peers can send generic queries and receive responses.
Peer Information Protocol (PIP)
PIP can be used to ping a peer in the JXTA environment. A peer receiving a ping
message has several options: It can give a simple acknowledgment, consisting only of its
uptime. It can send a full response, which includes its advertisement. Or it can ignore the
ping. Thus, there can be peers capable of receiving messages but not sending responses.
33
Peer Membership Protocol (PMP)
Peers use the Peer Membership Protocol for joining and leaving peer groups. This
protocol recognizes four discrete steps used by peers and thus defines JXTA messages for
each of these actions:
• Apply: A peer interested in entering a group can apply for a membership to the
group membership authenticator. The authenticator responds by sending back an
acknowledge message to the peer.
• Join: After an apply, the peer can choose to join the peer group.
• Renew: To update their membership information in the group, peers use the renew
message.
• Cancel: Peers can choose to cancel their peer group memberships.
The JXTA Java Binding
The best way to see the above protocols in action is to explore the JXTA Java
Binding, the JXTA reference implementation in Java. Developers can build on the existing
implementation or choose to implement their own version of the protocols in the languages
and platforms of their choice. Though the reference uses the HTTP and TCP/IP transports
because of their simplicity and popularity, you can implement the JXTA protocols on any
transport protocol, depending on the network topology.
The Class Organization
The JXTA Java Binding consists of two main class hierarchies:
• The net.JXTA.* classes
• The net.JXTA.impl.* classes
34
The first package contains all the JXTA interfaces, which are the Java interfaces
for the JXTA protocols and core building blocks. The second package contains these
interfaces' implementations. The interfaces and their implementations must be clearly
separated. Let's dive into these packages.
Where is My Peer?
A peer is an independent, asynchronous entity in the network associated with a
peer ID. You might consider an instance of running code as a peer. Currently, a boot class
(net.JXTA.impl.peergroup.Boot), which provides a main() method, starts a
peer.
A peer's capabilities depend on the groups to which it belongs. But, by virtue of
just being a peer, every peer exhibits some minimum capability -- having an ID, for
instance. That means that there must exist at least one peer group that every peer must be a
member of: the world peer group. Also called the platform peer group, the world peer
group is represented by the class net.JXTA.impl.peergroup.Platform, an
implementation of the PeerGroup class (net.JXTA.peergroup.PeerGroup).
Peer Groups as Applications
An important abstraction in binding, an application
(net.JXTA.platform.Application) is anything that a peer group can initialize,
start, and stop. It is interesting to note that one peer group
(net.JXTA.peergroup.PeerGroup) usually starts another peer group (refer to the
discussion on peer group nesting) and is hence an application. An exception is the platform
(or world) peer group. It is not started by any other peer group and forms the base of the
35
peer group hierarchy. An application defines three methods: init(), startApp(), and
stopApp(). The methods in the Application class are as follows:
public void init(PeerGroup group, Advertisement adv);
public int startApp(String[] args);
public void stopApp();
36
3. SYSTEM DESIGN
3.1 System Requirements
A peer to peer database integration system was developed using Java and MySQL
with JXTA protocols and the Piazza algorithm. The different data sources for the project
would be on a standalone system.
Different data sources are used to collect the required information and display the
result to the end user in the required format. This project has few data sources namely –
hospitals and clinical laboratories and fire stations. There may be one or many hospitals
and one or many laboratories. Hospitals would have the data of wards and capacity of each
department and general information that any user can view. Laboratories would have the
information of the patients’ blood samples. If the user of the system wants to see
information of all the hospitals and fire stations in a particular area with particular
department, it needs two queries to two different databases. But as this project provides a
global view of all the data from different sources, it collects the data from the peer
databases and results are given accordingly with only one query.
The process does not require any schema changes for the peers as the query is
executed in runtime, but the data sources should be consistent with respect to integrity of
data. Since the queries may range from simple to complex, there should be a particular
procedure that generates the query that can pull the required data from sources. Query
Reformulation or Piazza algorithm is used to create rule-goal tree which is used to
breakdown global query and generate queries that can extract the required data [Suciu
2003].
37
3.2 Piazza Algorithm
The algorithm takes as input a conjunctive query Q( X) that is posed at some peer,
and a set of peer mappings and storage descriptions. The following are the steps involved
in the algorithm:
1. Each equality description is transformed into two inclusion mappings.
2. Each inclusion of the form is then transformed to V Q2, and
V : − Q1, where V is a new predicate name. (‘ ’ says that Q1 is
proper subset of Q2).
3. Each node in the rule goal tree is labeled l(n) which is an atom whose
arguments are variables or constants. unc(n) is the father node of l(n).
4. The root of the tree is named and its children are the sub goals of the
global query.
5. Choose an arbitrary leaf node and expand it following the steps specified below
until no leaf node can be expanded further:
i. Expand node n with the definition of its head if the head appears in
the definitional description. Create a child node l(nr) with every
sub goal of l(n). This type of expansion is applied if the peers
appear in GAV- style.
ii. If the stored relation ‘p’ p appears in the right-hand side of an
inclusion description or storage description r of the form V U Q1
(or V = Q1), we do the following.
a. Let n1, . . . , nm be the children of the father node of n, and p1, .
. . , pm be their corresponding labels
38
b. The MCD (Minicon description) contains an atom of the form
V and the set of atoms in p1, . . . , pm that it covers.
An MCD is a mapping from a subset of the variables in the query to variables in
one of the views. Intuitively, an MCD represents a fragment of a containment mapping
from the query to the rewriting of the query [Halevy 2001]. An MCD C for a query Q
over a view V is a tuple of the form ),,)(,( cccc GYVh ψ where: ch is a head
homomorphism on V, cYV )( is the result of applying ch to V, i.e., )(AhY c= , where A
are the head variables of V, cψ is a partial mapping from Vars(Q) to ch (Vars(V)), cG is
a subset of the subgoals in Q which are covered by some sub goal in hc(V) using the
mapping cψ .
c. A child rule node is created nr for n labeled with r and a child
goal node ng for nr labeled with V .
6. Solution is constructed from rule goal tree T. Union of conjunctive queries over
the stored relations is the result of the global query.
7. The body of the conjunctive query is the conjunction of all the leaves of T.
A user enters the query, Q, on the interface and the algorithm creates a bucket
for each sub goal in Q that is relevant to answering that particular sub goal. The sub queries
collect data from different data sources and provide results in the format of global schema
[Halevy 2003]. The flow chart of the above algorithm is as shown below in Figure 3.2.1:
39
Figure 3.2.1 Flow chart of the query reformulation algorithm.
Consider a P2P system in which all peer mappings are definitional (similar to
GAV mappings in data integration). In this case, the algorithm is a simple construction of a
rule goal tree: goal nodes are labeled with atoms of the peer relations, and rule nodes are
labeled with peer mappings. It begins by expanding each query sub goal according to the
relevant definitional peer mappings in the PDMS. When none of the leaves of the tree can
be expanded any further, the storage descriptions are used for the final step of
reformulation in terms of the stored relations.
40
Suppose all peer mappings in the PDMS are inclusions in which the left-hand side
has a single atom (similar to LAV mappings in data integration). In this case, we begin
with the query sub goals and apply an algorithm for answering queries using views. The
algorithm is applied to the result until it cannot proceed further, and as in the previous case,
the storage descriptions are used for the last step of reformulation.
The first challenge of the complete algorithm is to combine and interleave the two
types of reformulation techniques. One type of reformulation replaces a sub goal with a set
of sub goals, while the other replaces a set of sub goals with a single sub goal. The
algorithm will achieve this by building a rule-goal tree, while it simultaneously marks
certain nodes as covering not only their parent node, but also their uncle nodes.
Before illustrating with an example, schema of data sources used is shown in Table 3.2.1.
Table 3.2.1 Database schema
For example, if we have to find out the hospitals or departments and fire stations
in the zip area = 7010, the query that we put to the system is Q(h,f,e,a = 7010) where h is
hospital id, f is fire station id and e is equipment and a is area. The query needs to get
records form two different peers with two different schemas. The concept of using peer
41
description comes handy here as stated in coordination rules which match each peer
schema and let the communication happen. The storage descriptions aids in combining
results from different databases within the same peer.
The main query can be decomposed into two sub goals as following:
(q) Q(h,f,e,a = 7010 ) :- sameareahospitals(h,f,e,q,’7010’),
sameareafirestations(f,h,e,q,a=7010’)
Peer descriptions:
(r0) sameareahospitals(h,a,d,b,p,n) proper subset of bfs(h,d,b,p,n,a=7010) and
cs(h,d,b,p,n,a=7010)
(r2) sameareafirestations(f,h,e,q,a) proper subset of bs(F,h,e,q,a=7010) and
pfs(F,h,e,q,a=7010)
Storage descriptions:
(r1) bfs(h,d,b,p,n,a=7010) is a subset of bfs(h,d,b,p,n,a)
(r1) cs(h,d,b,p,n,a=7010) is a subset of cs(h,d,b,p,n,a)
(r3) bs(f,h,e,q,,a=7010) is a subset of bs(f,h,e,q,a)
(r3) pfs(f,h,e,q,a =7010) is a subset of pfs(f,h,e,q,a)
Reformulated Query:
Q’ :- bs(h,f,e,a =7010), pfs(f,h,e,a =7010) , bfs(h,d,b,p,n,a =7010),
cs(h,d,b,p,n,a=7010)
42
Figure 3.2.2 Rule-goal tree for the query Q.
A query can be evaluated in a PDMS by sending it (reformulated appropriately) to
all the peers that might have answers. In such a scheme, it is absolutely vital that every
query not flood the entire network. The query reformulation algorithm devotes
considerable effort towards pruning rewritings that are guaranteed to return no results (or
redundant results). However, reformulation can only exploit information contained in the
schema mappings, whereas it would be desirable to exploit information about the actual
data stored at the peers in order to identify the peers relevant to the user query.
43
3.3 P2P Database and Coordination Rules
The database has been created in MySQL. Three different databases
corresponding to each peer have been created namely – Clinical laboratories, Hospitals and
Fire stations and each database corresponds to each node. The details from the three
databases should be fetched at 911 center according to the need. The schemas of the three
databases are as shown in ‘Database Schema’ in section 3.2.Mapping of attributes is
possible with declaration of co-ordination rules. A coordination rule has certain
specifications in its declaration.
Coordination Rule: A coordination rule allows a node i to fetch data from its
neighbor nodes j1,…..,jm. A coordination rule is an expression of the form [Franconi 2003]:
Let I be nonempty finite set of indices {1,2,3,..,n}, and C be a set of constants. For each
pair of distinct i, j ∈ I, let Li be first order function-free language with signature disjoint
from Lj, but for the shared constants C. A local database DBi is a theory on the first order
language Li. A coordination rule is an expression of the form:
wherexhiyxbjyxbj kkkk ),(:),(:...),(: 1111 ⇒∧∧ ijj k ,,...,1 are distinct indices, and each
),( lll yxb is a formula of Ljl, and h(x) is a formula of Li, and x = x1 U … U xk.
Coordination rule for the node N0 according to above schema is as shown in the next page:
44
Coordination rule snippet.
The above code corresponds to the following topology:
Figure 3.3.1 Topology.
In the above topology (Figure 3.3.1), we can query from node-0 and retrieve the results
from node-1 to node-3.
45
3.4 Workflow
After setting up the java environment, the batch file in each node is run which
opens up a window known as status window as shown in figure 3.4.1 and an interface as
shown in Figure 3.4.2 that shows that one of the peers engine has started, registers in the
topology and discovers any other peer around the sphere.
Figure 3.4.1 Status window.
46
Figure 3.4.2 Interface for each node (batch file).
Once all the peers have started click on “Read Coordination rules from file” on
the node 0 interface which reads the xml file of rules that defines the relationships among
peers. The following Figure 3.4.3 shows the status of peers during the above step-
Figure 3.4.3 Status during coordination rules announcement.
47
Now click on the “Publish Topology Advertisement” button that creates one way
channel from node 0 to all the other nodes followed by “Initialize connections in the
network” button that enables all the peers to participate in the network. Figure 3.4.4 shows
the status of the peers during the above step execution-
Figure 3.4.4 Peers are ready.
Now switch to the Queries tab and type in the required query to get the results
from all the participating peers.
48
Example: All the hospitals and department name with number of beds ‘700’
Q(h,a,d,b):-h1(h,a,d,b,p,n);b=’700’; The corresponding screen is as shown in Figure 3.4.5.
Figure 3.4.5 Execution of Query 1.
49
Example: Find out all the fire stations and hospitals in the city of corpus christi
Q(h,f,e,a) :- fs1(f,h,e,q,a),h1(h,a,d,b,p,n). The corresponding screen is as shown in the
Figure 3.4.5.
Figure 3.4.5 Execution of Query 2.
50
4. EVALUATION AND RESULTS
The system has been tested in two phases – during development phase and after
development phase. Since the system is basically built through integration, the process of
testing each unit and the system as a whole was main aspect of evaluation. Debugging has
also been given due importance as it might lead to any integration errors. The coordination
rules were checked more than once to make sure that all the peers are participating actively
in the sphere.
4.1 Evaluation
Testing the system during construction made it easy to figure out possible bugs
and eliminate them. Since the system is built using java and JXTA, all the core classes
were tested using JUnit cases. Each peer is tested separately by posing queries and
validated with the results obtained.
The hard part started when integrating the system with different peers. The
communication channel between two peers is established using pipes (one of the classes
in JXTA). The following problems have occurred with the configuration of pipes:
a. The channel is sometimes established in only one way allowing communication in only
one direction even though it is properly configured for two way communication. This
problem has been reported to JXTA forum and its been taken care of.
b. Intermittent disconnections occur as two peers communicate because of lack of peer
address storage perseverance. This problem has been addressed by increasing the size of
address table.
51
Redundant or duplicate tuples have been reported upon execution of the query
involving more than 3 peers or if the execution of query generates a rule goal tree of level
4 or more. This problem can be attributed to the structure of JXTA communication
architecture which needs to be corrected by JXTA developers in the future.
The time of execution of any query falls in the range 0.01 to 1.1 sec. Queries
involving one peer to a collection of peers is supported using piazza system.
The Piazza algorithm is theoretically well structured to add any number of peers
to the system and answer any kind of query however, the practical difficulty involved in
creating such complex topology or rule goal tree is hardly possible given the condition of
JXTA in its present form.
Another problem encountered during execution of a query that involves more than
3 peers with rule goal tree of level 2 or 1 is that results are sometimes either being
repeated or discarded.
4.2 Results
Different queries have been posted at node 0 and the results obtained from all the
peers have been evaluated with the corresponding results obtained by querying the database
itself directly and compiling the result set.
52
Table 4.2.1 Results of different query executions
The Query should be written in the following recommended syntax –
Q(fields from the n0 schema) :- schema name (fields from corresponding schema),[ schema
name (fields from corresponding schema)];condition
*condition: field name = value
53
5. FUTURE WORK
Future research includes reconciling peers with inconsistent integrity constraints,
and considering richer constraint languages at the peers. More generally, peer data
management is a very rich domain that creates a wealth of new problems, such as how to
replicate data, how to reconcile inconsistent data, and optimization across multiple peers.
Although the prototype application is still somewhat preliminary, it already
suggests that the architecture provides useful and effective mediation for heterogeneous
structured data, and that adding new sources is easier than in a traditional two-tier
environment. Furthermore, the overall Piazza system gives a strong research platform for
uncovering and exploring issues in building a semantic web.
A key aspect of the system is that there may be many alternate mapping paths
between any two nodes. An important problem is identifying how to prioritize these paths
that preserve the most information, while avoiding paths that are too diluted to be useful. A
related problem at the systems level is determining an optimal strategy for evaluating the
rewritten query.
54
6. CONCLUSION
The concept of the peer data management emphasizes not only an ad-hoc,
scalable, distributed peer-to peer computing environment (which is compelling from a
distributed systems perspective), but it provides an easily extensible, decentralized
environment for sharing data with rich semantics.
The primary contribution of the query reformulation algorithm is that it combines
both LAV- and GAV-style reformulation in a uniform fashion, and it is able to chain
through multiple peer descriptions to reformulate a query.
By this project, it would make it easier to extract results from different databases
at a time without the need to change any schema designs of individual databases.
55
7. BIBLIOGRAPHY
[Anderson 1995] T. E. Anderson, M. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli,
and R. Wang. Serverless network file systems. In SOSP 1995, volume 29(5), pages 109–
126, December 1995.
[Berners 2001] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific
American, May 2001.
[Bolosky 2000] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a
serverless distributed file system deployed on an existing set of desktop pcs. In Proc.
Measurement and Modeling of Computer Systems, 2000, pages 34–43, June 2000.
[Cao 1998] P. Cao, J. Zhang, and K. Beach. Active cache: Caching dynamic contents on
the web. In Middleware ’98, Sept. 1998.
[Chen 1976] Chen, P.P., “The entity-relationship model: towards a unified view of data,”
ACM Transactions on Database Systems, vol. 1, no. 1, pp.9-36, 1976.
[David 1983] Maier. David, “Null Values Partial Information and Database Semantics,” pp.
371-438 in The Theory of Relational Databases (1983).
[Doan 2002] A. Doan and A. Halevy. Efficiently ordering query plans for data integration.
In Proc. of ICDE, 2002.
[Fan 1998] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: A scalable wide-
area web cache sharing protocol. In Proc. Of ACM SIGCOMM ’98, August 1998.
[Gray 1996] J. Gray, P. Helland, P. E. O’Neil, and D. Shasha. The dangers of replication
and a solution. In SIGMOD ’96, pages 173–182, 1996.
[Franconi 2003] E. Franconi, G. Kuper, A. Lopatenko, and L. Serafini. A Robust Logical
and Computational Characterisation of Peer-to-Peer Database Systems, in International
Workshop On Databases, Information Systems and Peer-to-Peer Computing, 2003.
(Slides)
[Halevy 2003] A. Halevy, Z. Ives, P. Mork, and I. Tatarinov. Piazza: Data Management
Infrastructure for Semantic Web Applications. In WWW 2003.
[Ives 2000] Z. G. Ives, A. Y. Levy, J. Madhavan, R. Pottinger, S. Saroiu, I. Tatarinov, S.
Betzler, Q. Chen, E. Jaslikowska, J. Su, and W. T. T. Yeung. Self-organizing data sharing
communities with SAGRES. In SIGMOD ’00, page 582, 2000.
[Kubiatowicz 2000] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,
R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer,About LEGION – the Grid OS.
World-wide web: www.appliedmeta.com/legion/about.html., 2000.
56
[Leonid 1998] Leonid Stoimenov, Aleksander stanimirovic, slobodanka djordjevic-kajan
“Discovering mappings between ontologies in semantic integration process”
[Rodriguez 2003] Rodriguez, M.A, Egenhofer M., Determining Semantic Similarity
Among Entity Classes from Different Ontologies, IEEE Transaction on Knowledge and
Data Engineering, 2003.
[Napster 2001] Napster. World-wide web: www.napster.com, 2001.
[Rabinovich 1998] M. Rabinovich, J. Chase, and S. Gadde. Not all hits are created equal:
Cooperative proxy caching over a wide area network. In Proc. of the 3rd Int. WWW
Caching Workshop, June 1998.
[Siong 2003] W. Siong Ng, B. Chin Ooi, K. L. Tan, and A. Ying Zhou. Peerdb: A p2p-
based system for distributed data sharing. In International Conference On Data
Engineering (ICDE), 2003.
[Suciu 2003] A. Halevy and Z. Ives and D. Suciu and I. Tatarinov. Schema Mediation in
Peer Data Management Systems. In ICDE 2003.
[Tanenbaum 1996] A. S. Tanenbaum. Computer Networks. Prentice Hall PTR, 3rd edition,
1996.
[Wikipedia 2007]
http://www.conceptdraw.com/products/img/ScreenShots/cd5/software/Chen_ERD.gif
57
8. APPENDIX
1. GAV - In the Global-As-View (GAV) approach, one defines the concepts in the
global schema as views over the data sources.
2. LAV – In the Local-As-View (LAV), one characterizes the data sources as views
over the global schema.
3. Materialized Views - A materialized view takes a different approach in which the
query result is cached as a concrete table that may be updated from the original
base tables from time to time. This enables much more efficient access, at the cost
of some data being potentially out-of-date.
4. Mediated Schema – Mediated Schema allows a user to access multiple databases
by creating mappings between source schema and mediated schema.
5. Mobile Agent - A Mobile Agent is a composition of computer software and data
which is able to migrate (move) from one computer to another autonomously and
continue its execution on the destination computer.
6. Node - A node is a critical element of any computer network. It can be defined as
a point in a network at which lines intersect or branch, a device attached to a
network, or a terminal or other point in a computer network where messages can
be transmitted, received or forwarded.
7. Ontology – It is a branch of metaphysics, often considered the most fundamental.
It is the study of the nature of being, existence, or reality in general and of its
basic categories and their relations, with particular emphasis on determining what
entities exist or can be said to exist, and how these can be grouped and related
58
within an ontology (typically, a hierarchy subdivided according to similarities and
differences).
8. P2P - Peer to peer (P2P) is a network protocol for computer users, used for
downloading torrents or P2P files. Rather than connecting to the Internet, P2P
software allows surfers to connect with each other to search for and download
content. Because of the unique structure of a P2P network, it is very efficient for
downloading large files.
9. Schema - The schema of a database system is its structure described in a formal
language supported by the database management system (DBMS). In a relational
database, the schema defines the tables, the fields in each table, and the
relationships between fields and tables.
10. Topology - It is the study of the arrangement or mapping of the elements (links,
nodes, etc.) of a network, especially the physical (real) and logical (virtual)
interconnections between nodes.
11. View - A view is a stored query accessible as a virtual table composed of the
result set of a query. Unlike ordinary tables (base tables) in a relational database, a
view is not part of the physical schema: it is a dynamic, virtual table computed or
collated from data in the database. Changing the data in a table alters the data
shown in the view.