Home > Documents > Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Date post: 13-Jan-2016
Category:
Author: vaughan
View: 14 times
Download: 0 times
Share this document with a friend
Description:
Distributed Query Processing and Catalogs for Peer-to-Peer Systems. Professor: Iluju Kiringa Student: Fan Yang, Libin Cai. Agenda. About P2P Mutant Query Plan Distributed Catalog Intentional Statements Security and Privacy Conclusions. About P2P. Advantages: Ease of deployment - PowerPoint PPT Presentation
Embed Size (px)
Popular Tags:
of 23 /23
Distributed Query Distributed Query Processing and Processing and Catalogs for Peer- Catalogs for Peer- to-Peer Systems to-Peer Systems Professor: Professor: Iluju Kiringa Iluju Kiringa Student: Fan Yang, Libin Student: Fan Yang, Libin Cai Cai
Transcript
  • Distributed Query Processing and Catalogs for Peer-to-Peer SystemsProfessor: Iluju Kiringa Student: Fan Yang, Libin Cai

  • AgendaAbout P2PMutant Query PlanDistributed CatalogIntentional StatementsSecurity and PrivacyConclusions

  • About P2PAdvantages:Ease of deploymentEase of useFault toleranceScalabilityLimitations:Weak query capabilitiesNo infrastructure for distributed queriesLimitations in index scalability and result quality

  • A query example FOR $r in document(film_reviews)//review, $g in document(preferences)//genre,$s in document(film_showings) / showing[date = 15 March 2002]WHERE $r/genre = $g AND $r/title = $s/titleRETURN { $r/title } { $r/rating } { $s/theater } User Bob wants to see a movie tonight.Bob visits his favorite portal, BobsPortal.com.Bob uses GUI front-end to come up with an XML query:

    Three XML documents: film reviews, preferences, and film showings.[2]

  • A query example (cont)The logical query planThree elements: Regular query operators: select, joinPseudo-operator: document, display References to XML fragmentsQuery processing: logical query plan

    physical query plan

    query processingexecutedalgorithm [2]

  • Advent of Mutant Query PlanWhy is MQP?can cope with incomplete metadatacan decentralize query optimization and executionRespect the autonomy and the local policies of sites Adapt to server and network conditions even while being evaluatedWhat is MQP? An algebraic query plan graph, encoded in XMLReferences to resource locations (URLs) References to abstract resource names (URNs) Verbatim XML fragmentsEach MQP is tagged with a target once the MQP is fully evaluated.

  • Mutant Query Processing[1]

  • Mutant Query Plan ExampleGarage Sale example: Query: CDs for $10 or less in the Portland area.MQP: Regular query operators: select, joinPseudo-operator: displayConstant piece of XMLURNs[1]

  • Mutant Query Plan Example (cont)(a) Resolution and rewriting (b) reduction[1]

  • Comparisons between Pipelined plan and Mutant plan(a) Pipelined plan (b) mutant plan[2]

  • Distributed CatalogsQuestion: how do peers find out resources available in other peers? Build distributed catalogs to efficiently route queries Procedures:Peers use multi-hierarchic namespaces to categorize data;Data providers use multi-hierarchic namespaces to describe data they serve;Data consumers use them to formulate queries.

  • Multi-hierarchic NamespacesMulti-hierarchic namespace: The set of categorization hierarchies relevant to an applications domain. [1]

    Interest area:Second-hand armchairs in the Portland area: [USA/OR/Portland, Furniture/Chairs]A multi-hierarchic namespaces with two categorization dimensions and two highlighted interest areas: (a) Vancouver-Portland furniture, (b) items in Portland[1]

  • Peer Roles

  • Resource ResolutionAuthoritative ServerStrives to know about all base servers within its interest area.Through an authoritative index or meta-index server, the known base servers in a particular interest area can be found out. Resource ResolutionSeeks authoritative index or meta-index server Recursively follows the index references Finds all the relevant base servers and data itemsResolves URN

  • Example of Resource ResolutionUrn: ForSale: Portland-CDsurls: http://10.1.2.3.9020/, http://10.2.3.4.9020/ Interest area: [USA/OR/Portland, Music/CDs]Authoritative meta-index server A :[USA, *]Index Server B: [USA, Music]Index Server C: [USA/OR, Music]Index Server G: replace URN with URLsQuery plan A B C G http://10.1.2.3.9020/ http://10.2.3.4.9020/

  • Intentional Statements

    Purposes:How can index and meta-index servers convey the relationships between the data they cover?How can mutant queries use this information to make intelligent choices about completeness, currency and latency tradeoffs?Intentional Statements: used to describe relationships between index and meta-index servers, can be expressed using coordination formulas.Server R replicates everything from server S for the Portland category of the Location hierarchy Only Oregon sporting goods information that R holds is for Portland and Eugene golf clubs at S R index several base servers base[Portland, *]@R = base[Portland, *]@S

    base[Oregon, Sporting Goods]@R = base[Portland, Golf Clubs]@S base[Eugene, Golf Clubs]@S

    Index[Oregon, Golf Clubs]@R = base[Oregon, Golf Clubs]@S Base[base[Oregon, Golf Clubs]@T base[base[Oregon, Golf Clubs]@U

  • Utilizing Intentional Statements (cont)Processes:Whenever a server registers an interest area with the meta-index server, it provides intentional statements Servers can then use such information in binding and routing MQPs.

    Assumptions: Meta-index server M knows about servers R and SInterest areas: R [Portland, Recreation] S [Oregon, Sporting Goods]M receives an MQP that contains the resource name [Portland, Golf Clubs] Then the name could be bound to: base[Portland, Golf Clubs]@R base[Portland, Golf Clubs]@SIf M knows the intentional statement, base[Portland, Sporting Goods]@R = base[Portland, Sporting Goods]@S then it could bind to: base[Portland, Golf Clubs]@R | base[Portland, Golf Clubs]@S Conclusion: the MQP could be routed to either R or S, but it need not go to both.

  • Utilizing Intentional Statements (cont)For queries run not instantly:Suppose: Server R replicates everything for Portland at S, also possibly keeps additional data about Portland, can be up to 30 minutes out of dateR polls every 30 minutes to update the data it replicates from S.Intentional Statement: base[Portland, *]@R base[Portland, *]@S{30}A binding for resource [Portland, CDs] might then be: base[Portland, CDs]@R{30} | (base[Portland, CDs]@R base[Portland, CDs]@S){0}Explanations:One can get an answer quickly by just routing the MQP to R, but that answer could be up to 30 minutes out of date.By routing the MQP to both R and S, one can have a complete and current answer.

    Conclusions:Impossible to guarantee queries run instantly Compromises on latency, completeness and currency. Replication cant be both scalable and instantaneous.

  • What else could be in MQPsAccumulating catalog and statistics informationMaintaining provenanceRewards systemMeta-index updatingDetection of spoofing

  • Security and PrivacyIssues:With MQPs, the partial results is possibly divulged to other undesirable serversSolutions:MQPs need to incorporate ordering and transfer policiesEncrypts data or data elements with the public keyMQPs can allow to obtain answers under given server security policies

  • ConclusionsEnable peers to independently optimize and partially evaluate queries without global knowledge, and with a minimum of coordination overhead.

  • References

    [1] Vassilis Papadimos, David Maier and Kristin Tufte. Distributed Query Processing and Catalogs for Peer-to-Peer Systems. OGI School of Science Engineering. Oregon Health Science University.[2] V. Papadimos and D. Maier. Distributed Queries without Distributed State. In Proc. of WebDB 2002, pages 95-100.

  • Thanks!

    Questions?...

    We know that P2P means Peer to Peer. In a P2P distributed system, any nodes, which can both consume as well as provide data and/or services, may join and leave the P2P network at any time, resulting in a truly dynamic and ad-hoc environment.Here are some advantages. I wont go through details of advantages. Since I believe another group will give a wonderful presentation about them. Here I would like to say some limitations .

    Limitations:We know that the Internet can be said as the most successful distributed computing system ever. However, the queries we can ask of remote servers are limited:Weak query capabilities: (Examples)1. The schema and queries for searching for content are typically hardwired into the application; (application do the search)2. Get the data for a given URL, or ask predefined queries using some form interface. 3. Also, queries only involve a single client pulling data from a single server: 4. There is no infrastructure for distributed queries. -- Many queries that we routinely want to ask require combining data from different data sources. -- We cannot always move the data pertinent to a query to a single server and do the processing there.

    Ease of deployment: Each user installs a single package that encompasses both client and server code; its initial configuration depends only onknowing a fixed index server or a single other installation; servers need not be continuously active.Ease of use: The server code is bundled with a user interface application to publish, search and retrieve content.Fault tolerance: Failure or unavailability of a single server (other than a central index) does not disable the system. It might render some content unavailable, but much of the content ends up being heavily replicated.Scalability: As the number of users and amount of content increase, so does the number of servers; protocols do not require all-to-all communication or coordination.

    Now, lets look at a query example:This query works on three XML documents: film reviews, preferences, and film showings. There are several kinds of magic going on here;BobsPortal.com is smart enough to know that preferences means Bobs preferences, and film showings only includes theaters in Bobs town. The query also treats these documents as abstract resources;it does not mention their actual locations anywhere.

    The query processor will start by translating Bobs query into a logical query plan (Figure), which is a directed graph of logical query operators, such as select or join, that consume and produce sequences of tuples. A tuple contains references to XML fragments. We also have special pseudo-operators, such as document, which creates a sequence of tuples by fetching data from a URL, and display, which presents results on the clients computer.

    We turn a logical query plan into a physical query plan by selecting an implementation algorithm for query operators, such as nested-loops, or hash join for logical join. A physical query plan can be executed directly by the query processor.

    BobsPortal.com however, may not have the film reviews and film showings data locally; so how should it process the query? If it knows how to resolve these abstract documents into actual URLs, it could download them, and process the query. And then it may need transfer large amounts of data (every movie showing at Bobs town, and reviews for every movie currently playing anywhere), even though we only need a subset of these data.

    The problems here are:If it doesnt know how to resolve the URN? We also want to do some processing near the data sources to reduce data transfer.

    In the traditional distributed query processing model, it is one site (called the coordinator) to optimize a clients query into a distributed query plan: Each operator in the query plan has its own annotation to indicate where it should run. Then the coordinator sends the subplans to the apprentice sites. When each subplan is executed on its site, it returned the result to the coordinator. For query optimization to work well, the coordinator needs detailed statistics such as data placement, which site participated and how it processed.Problems are: Its hard to scale to large networks of autonomous sites. A remote site may refuse to run sub-queries (it may be down, off line, or overloaded). Further, we cannot hope to maintain accurate and timely statistics on the location.And its hard to maintain characteristics of all possible Internet resources at a single centralized location.

    Pseudo-operator: constant encapsulates an XML fragment, resource represents a URN, and display specifies the final destination. Each server has a metadata catalog that maps each URN to a URL, or to servers that can resolve it.

    We introduce a framework using mutant query plans to decentralize query optimization and execution. Mutant query plans can cope with these problems: incomplete metadata, decentralized query optimization, respect the autonomy and adapt to server and network conditions even while being evaluated.To evaluate a mutant plan, we resolve these URNs to their corresponding URLs. This is very important feature. Since our purpose is to fully evaluate a MQP plan, which means the plan will contain either the URLs or XML fragments. The tag of a MQP means the destination network address.

    Now lets see how to transform the original query into an evaluated query plan.A server parses a MQP into a tree of query operators and constant data.Every server maintains a local catalog that maps each URN to either a URL, or to a set of servers that know more about the URN. The server resolves the URNs it knows about, then its optimizer component (re)optimizes the plan, and finds or creates sub-plans that can be evaluated locally, with their associated costs. The policy manager, at this point, makes the decision to accept or reject the mutant plan (maybe the server is overloaded, or the plans cost is too high). The policy manager also decides how much of a plan to evaluate locally, and passes those sub-plans to the query engine. The server then substitutes each evaluated sub-plan with its results (as an XML fragment), to get a new, mutated query plan.

    If the plan is not yet fully evaluated, we must decide the next server to send it to. Again, consulting the catalog, we send the plan to a server that knows how to resolve at least one of the remaining resources.A given server does not need to know how to resolve every URN in a plan. As long as the plan eventually passes through a server that does, it can be evaluated. At some point, a server will hopefully reduce the plan to an XML document and forward it to its final destination (which may be different than its origin), or alternatively, report its failure to process the plan further.

    Here is the garage sale example: Suppose we are looking for CDs for $10 or less in the Portland area. Lets look at the Mutant Query Plan:The plan includes regular query operators such as select and joinA display pseudo-operator that specifies the query plans targetA constant piece of XML with the songs we are looking forAnd two Urns: Now we have two steps in this process.In (a), a server resolves the ForSale URN to a union of two seller URLs, pushes the select operator through the union, and forwards the plan to one of the seller servers.In (b) the server substitutes its CD data for its URL, evaluates the select and reduces its part of the plan to a constant piece of XML data. This series of URN-to- URL-to-data resolutions and sub-plan reductions will continue until the whole plan is evaluated, and forwarded to its target.

    If the plan is not yet fully evaluated, we must decide the next server to send it to. For example, the blue circle cant be resolved, then the MQP plan is forwarded to another server.

    Here, consulting the catalog, we can decide the next server.Well, to further illustrate what MQP brings us, lets look at the comparisons between Pipelined plan and Mutant plan.We compare the two prototypes on a simple query, using the XML-encoded plays of Shakespeare2. The query asks for all the lines of Sir John Falsta, in any play. Our setup has three identical servers, A, B, and C, with a fourth machine as the client.B stores all the comediesC stores all the histories and tragedies.

    In the pipelined plan (Figure (a)) the client submits the query to the coordinator, A, which unions two sub-plans, running on B and C. Sub-plans scan their local plays, select Falstas lines, and stream them to A.In the MQP version (Figure (b)), the query plan is again the union of two selects, but contains just URNs to tragedies and comedies. The client sends the MQP to A, which routes it to B with no local evaluation. B resolves the comedies URN to a set of URLs, executes that part of the plan, appends its local Falsta lines to the mutant plan, and sends it to C. C resolves the tragedies URN, executes the rest of the plan, and sends the final results to the client.

    The mutant plan has worse latency than the pipelined one, since it only works on one server at a time.Notice that the mutant plans total time is less than the sum of the B and C times of the pipelined plan:The mutant plan transfers fewer tuples: Falsta lines from the tragedies are transferred once, between C and the client, instead of going from C to A to the client.The footprint of the pipelined plan is worse than for the mutant plan because A must wait for both apprentice sites to finish, before it finishes.

    While performance on individual queries is somewhat worse, overall load on servers is reduced. Traditional distributed query processing requires distributing, activating and simultaneous communication with subplans; it requires distributed state. Our approach, in contrast, allows evaluation of distributed queries while maintaining only local state, at any point (except for brief periods to transfer MQPs).

    We can route MQPs around unavailable data sources. A mutant plan will head for servers that can perform some work, leaving the unavailable servers for last. If those servers are still unavailable, it can either lay dormant, or head to the client with the partial results it has gathered.

    Its often the case that data are stored, grouped, replicated and queried according to one or more categorization hierarchies that are natural for the application. Here I introduce Multi-hierarchic Namespace the natural categorization hierarchies to build distributed indices for query routing.

    Suppose we are looking for second-hand armchairs in the Portland area. Our interest area is then [USA/OR/Portland, Furniture/Chairs] and we only have to contact servers whose interest areas overlap with ours to find out about all pertinent items.

    A base server: A seller in our P2P garage sale example might have an interest area of [USA/OR/Portland, Music/CDs].An index server for the P2P garage sale could index, for example, servers overlapping [USA/OR, *]. Index servers can also maintain indices on data attributes not used for categorization, e.g., price.A category server:As with index and meta-index servers, category servers can cooperate with each other to manage their namespaces. Category servers can delegate portions of the namespace they manage to other category servers, much like the way DNS servers can delegate sub-domains to other servers.

    Meta-index servers can afford to cover much larger interest areas than index servers, because they only maintain multi-hierarchic namespace indices.Peers can maintain caches with index and meta-index servers they used in the past. A peer that joins the P2P network for the first time will have to discover category servers, and also meta-index servers that serve top-level categories, for example a meta-index server that covers [France, *], and it obviously cannot use the P2P network for that. Peer software can either include hardwired locations of such servers, or preferably discover them out-of-band, for example by doing a search on a web search engine.

    When we talk about registration of a new peer, we should have a role Authoritative Server. Routing a plan through an authoritative index or meta-index server allow it to find out the known base servers in a particular interest area.

    An index or meta-index server that wishes to become authoritative for an interest area must first find the most detailed authoritative server group. At that point, the server must register with the other servers in that group so that it can start receiving registrations and updates from servers within its interest area, and also start receiving queries.

    When a base server want to join the P2P network, it needs to register with index or meta-index servers that intersect with its interest area, to make their data available to other peers. Ideally, the servers it registers with should include authoritative servers whose union covers its interest area.the URN we are trying to resolve has an interest area of [USA/OR/Portland, Music/CDs]. Our client may already know an authoritative meta-index server for [USA, *], so it sends the query plan there. This server may forward the query plan to a server for [USA, Music], which may then forward it to a server that knows about [USA/OR, Music] and so on, until we reach an index server that will replace the URN with a combination of URLs. (To avoid flooding high-level servers with plans, peers maintain caches of index and meta-index servers for interest areas, so that they can route plans more efficiently in the future.)There is no guarantee that we can find an authoritative server for every query. It may very well be that we cannot find any servers for some part of a querys interest area, or that, to get a complete answer, we may have to contact multiple servers that collectively cover an interest area.

    So, how does the server decides which nodes route, index (and sometimes store or cache) which data?Distributed hash table (DHT) algorithm is one way.Government agencies , such as the NIH, would provide meta-index services, and fund the development of controlled vocabularies and ontologies.Peers buy or sell data objects and place bids to execute sub-queries.Our ideas on the intelligent routing of query plans based on intentional statements about server coverage, completeness and redundancy and a form of semantic query optimization. Meta-index servers map interest areas to collections of URLs at index or base servers ( or possibly other meta-index servers).

    We can see that some servers may be wholly or partially redundant with others.We also hope the server can also announce its policies to replicate or index information at other servers.

    How are such intentional statements used in the processing of MQPs? First of all, whenever a server registers an interest area with a meta-index server, it can also provide intentional statements that the meta-indexserver can retain. Servers can then use such information in binding and routing MQPs.The latency for query evaluation will likely be longer in the second case, because of the need to visit two sites rather than one.

    More likely, servers will periodically contact other servers to update content.

    Accumulating catalog and statistics information: As an MQP passes through a server, that server may have information about portions of a query it chooses not evaluate, but that may be useful at later processing steps.A server can also improve its catalog information by examining a URN in the original query and its set of URLs in the partially evaluated query.

    Maintaining provenance: An MQP can also carry along a history of all the servers it has visited, as well as what each one did (provided bindings, provided data, re-optimized the MQP, evaluated a sub-expression, or merely forwarded the MQP), when it did it, and how current the information was.That provenance can then be used at the final destination or at intermediate servers for a variety of purposes:

    The benefit of knowing the processing history of a query: Rewards system: If server S observes that many of the queries it is getting for its data are because of indexes maintained at server T, S might reward T in some way.For example, S might devote a larger percentage of its index space to Ts data in return. Meta-index updating: If server S is getting a lot of MQPs forwarded from server T that it just ends up forwarding to server R, S might be able to send T a metaindex entry to allow it to route some of those queries directly to R. Or S might observe that T declines to bind source B even though T holds a copy of B. S might then decide to route MQPs needing B elsewhere in the future. Detection of spoofing: To this point, we have been assuming that MQP servers behave correctly, and certainly not maliciously. But what if server S tried to tinker with queries to the detriment of a competitors server T? For example, server S may get an MQP P with an expression sD(A) sD(B), where A has data records at S and B has records at T. S could bind A to its actual value, but bind B to the empty set, making it appear that T has no qualifying items. If provenance is recorded, the resulting MQP would show that P never visited T (or any other site for B). If A also spoofs the provenance, to make it appear T participated, then it possible to construct a verification query (e.g., count(sD(B))) to send to T to check the result in P. To make the provenance more trustworthy, each addition to it could be digitally signed by the server that adds it and encrypted with the public key for the destination site. However, provenance is not acomplete solution to a misbehaving server. In the example above, it is hard to detect if S is lying about As contents.Issues: For example, a query submitter might not want his or her music preferences known to a track-list server. Or an intermediate server might not want its data exposed to a competitors server down the line.

    Solution 1: do not bind preferences until playlist is bound or only let this MQP pass through servers on this list.Solutions 3: 1) Suppose a law enforcement agency wants to know which employees of a given company have made charitable contributions over 5000 dollars to organizations that are believed to be fronts for illegal activities. The peer/server IRS has tax returns showing itemized deductions for contributions, and the peer/server, the State Department has a list of front organizations.2) But the IRS may balk at disclosing all contributions for all employees at a company.3) Also, the State Department may not want to reveal its list of suspect organizations.To solve this security problem, we can handle the mutant query plans like as below.1) Firstly, in the peer/server of IRS, based on the specific company, find the employee names and the charity names withthe condition that the contribution is over 5000 dollars.2) Secondly, the results of the first step will be bounded into the MQP and travel to the peer/server, the State Department.3) Lastly, the results from IRS will be joined with the front-organization list, and then project onto person name. The fullyevaluated MQP now is routed back to the law enforcement agency.Through this way, neither the IRS nor the State Department had to disclose excessive sensitive information to the agency.

    Our main assumption is that data and query result distributions can be mapped naturally to multi-hierarchic namespaces, allowing us to build decentralized indices for efficient query routing. We believe that this assumption is reasonable, not just for our P2P garage sale example, but for a wide range of P2P data management applications.


Recommended