Survery Uncertain Data Alg

8/3/2019 Survery Uncertain Data Alg

1/15

A Survey of Uncertain Data Algorithmsand Applications

Charu C. Aggarwal, Senior Member, IEEE, and Philip S. Yu, Fellow, IEEE

AbstractIn recent years, a number of indirect data collection methodologies have led to the proliferation of uncertain data. Such

databases are much more complex because of the additional challenges of representing the probabilistic information. In this paper, we

provide a survey of uncertain data mining and management applications. We will explore the various models utilized for uncertain data

representation. In the field of uncertain data management, we will examine traditional database management methods such as join

processing, query processing, selectivity estimation, OLAP queries, and indexing. In the field of uncertain data mining, we will examine

traditional mining problems such as frequent pattern mining, outlier detection, classification, and clustering. We discuss different

methodologies to process and mine uncertain data in a variety of forms.

Index TermsMining methods and algorithms, database applications, database management, information technology and systems.

1 INTRODUCTION

IN recent years, many advanced technologies have beendeveloped to store and record large quantities of datacontinuously. In many cases, the data may contain errors ormay only be partially complete. For example, sensornetworks typically create large amounts of uncertain datasets. In other cases, the data points may correspond toobjects which are only vaguely specified, and are thereforeconsidered uncertain in their representation. Similarly,surveys and imputation techniques create data which isuncertain in nature. This has created a need for uncertaindata management algorithms and applications [2].

In uncertain data management, data records are typically

represented by probability distributions rather than deter-ministic values. Some examples in which uncertain datamanagement techniques are relevant are as follows:

. The uncertainty may be a result of the limitations ofthe underlying equipment. For example, the outputof sensor networks is uncertain because of the noisein sensor inputs or errors in wireless transmission.

. In many cases such as demographic data sets, onlypartially aggregated data sets are available because ofprivacy concerns. Thus, each aggregated record canbe represented by a probability distribution. In otherprivacy-preserving data mining applications, the

data is perturbed in order to preserve the sensitivityof attribute values. In some cases, probability densityfunctions of the records may be available. Somerecent techniques [8] construct privacy models, suchthat the output of the transformation approach isfriendly to the use of uncertain data mining andmanagement techniques.

. In some cases, data attributes are constructed usingstatistical methods such as forecasting or imputa-tion. In such cases, the underlying uncertainty inthe derived data can be estimated accurately fromthe underlying methodology. An example is that ofmissing data [56].

. In many mobile applications, the trajectory ofthe objects may be unknown. In fact, manyspatiotemporal applications are inherently uncer-tain, since the future behavior of the data can bepredicted only approximately. The further into the

future that the trajectories are extrapolated, thegreater the uncertainty.

The field of uncertain data management poses a numberof unique challenges on several fronts. The two broad issuesare those of modeling the uncertain data, and then lever-aging it to work with a variety of applications. A number ofissues and working models for uncertain data have beendiscussed in [2] and [34]. The second issue is that ofadapting data management and mining applications towork with the uncertain data. The main areas of research inthe field are as follows:

. Modeling of uncertain data. A key issue is the

process of modeling the uncertain data. Therefore,the underlying complexities can be captured whilekeeping the data useful for database managementapplications.

. Uncertain data management. In this case, onewishes to adapt traditional database managementtechniques for uncertain data. Examples of suchtechniques could be join processing, query proces-sing, indexing, or database integration.

. Uncertain data mining. The results of data miningapplications are affected by the underlying uncer-tainty in the data. Therefore, it is critical to designdata mining techniques that can take such uncer-

tainty into account during the computations.In the next sections, we will discuss these different

aspects of uncertain data representation, management, andmining. We will discuss the different issues with uncertain

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009 609

. C.C. Aggarwal is with the IBM T.J. Watson Research Center,19 Skyline Drive, Hawthorne, NY 10532. E-mail: [email protected].

. P.S. Yu is with the Department of Computer Science, University of Illinoisat Chicago, Chicago, IL 60607. E-mail: [email protected].

Manuscript received 12 Oct. 2007; revised 21 May 2008; accepted 2 Sept.

2008; published online 10 Sept. 2008.Recommended for acceptance by C. Clifton.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2007-10-0501.Digital Object Identifier no. 10.1109/TKDE.2008.190.

1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on July 17,2010 at 06:09:12 UTC from IEEE Xplore. Restrictions apply.


2/15

data representation, and their corresponding effect ondatabase applications. We will also survey a broad varietyof database management and mining applications.

This paper is organized as follows: In Section 2, we willexamine the issue of uncertain data representation andmodeling. In Section 3, we will examine a number of datamanagement algorithms for uncertain data. We specificallyexamine the problems of query processing, indexing,

selectivity estimation, OLAP, and join processing. A numberof mining algorithms for uncertain data are discussed inSection 4. We examine the clustering and classificationproblem as well as a general approach to mining uncertaindata. Section 5 contains the conclusions and summary.

2 UNCERTAIN DATA REPRESENTATION ANDMODELING

The problem of modeling uncertain data has been studiedextensivelyin theliterature[1], [46], [45], [49], [72]. A databasethat provides incomplete information consists of a set of

possible instances of the database. It is important to distinguishbetween incomplete databases and probabilistic data, sincethe latter is a more specific definition which creates databasemodels with crisp probabilistic quantification.

2.1 Probabilistic Database Definitions

A probabilistic database is defined [45] as follows:

Definition 2.1. A probabilistic-information database is a finite probability space whose outcomes are all possible databaseinstances consistent with a given schema. This can berepresented as the pair X; p, where X is a finite set of possible database instances consistent with a given schema,

and pI is the probability associated with any instance I2 X.We note that since p represents the probability vector overall instances in X, we have PI2XpI 1.We note that the above representation is a formalism of

the possible worlds model [1]. The direct specification ofsuch databases is unrealistic, since an exponential numberof instances would be needed to represent the table.Therefore, the natural solution is to use a variety ofsimplified models which can be easily used for data miningand data management purposes. We will discuss more onthis issue slightly later.

Probabilistic ?-tables [41], [54] are a simple way ofrepresenting probabilistic data. In this case, one models theprobability that a particular tuple is present in the database.Thus, the probability of a particular instantiation of thedatabase can be defined as the product of the probabilitiesof the corresponding set of tuples to be present in thedatabase with the product of the probabilities of thecomplementary set of tuples to be absent from the database.

A closely related probabilistic representation is that ofprobabilistic or-set tables. While the probabilistic ?-table isconcerned with the presence or absence of a particulartuple, the p-or-set table is concerned with modeling theprobabilistic behavior of each attribute for a tuple that is

known to be present in the database. In this case, eachattribute is represented as an or over various possibilitiesalong with corresponding probability values. An instantia-tion of the database is constructed by picking each outcome

for an attribute independently. The ProbView modelpresented in [54] is a kind of or-set table. The onlysignificant difference is that the ProbView model usesconfidence values instead of probabilities. Other interestingrepresentations of probabilistic databases may be found in[45]. A number of interesting properties of such databasesare also discussed in [41], [54], and [80].

2.2 Simplifying Assumptions in PracticalApplications

The above definitions are fairly general formalisms forprobabilistic data. In many practical applications, one mayoften work with simplifying assumptions on the underlyingdatabase. One such simplifying assumption is that thepresence and absence of different tuples is probabilisticallyindependent. In such a formalism, all possible probabilitydistributions on possible worlds are not captured with theuse of independent tuples. This is referred to as incomplete-ness. Furthermore, one needs to be careful in the applicationof such a formalism, since it may result in inconsistency. Forexample, in an uncertain spatiotemporal database with

tuples representing object locations at different times, thelocations of the objects need to be consistent. In anyparticular instantiation of the database, it is important thatthe same object not be in multiple localities at the sametime. Therefore, if the database is represented in terms ofpositional tuples, one needs to check for consistency oftuples within a given temporal locality. Such restrictions areoften represented by rules that enforce the relationshipsbetween the behavior of the different tuples.

Most data mining or query processing applicationswork with further simplifications. For example, attribute-uncertainty models represent attributes by a discrete orcontinuous probability distribution, depending upon thedata domain. Some applications [9] on continuous data mayeven work with statistical parameters such as the underlyingvariance of the corresponding attribute value. Thus, from anapplication point of view, theuncertaintymay be representedin different ways, and it may not always conform to adatabase-centric view. Furthermore, uncertain databases areoften designed with specific application goals in mind. Ourdiscussions above can be summarized in terms of the majorclasses of uncertainty that most applications work with.Broadly, most applications work on two kinds of uncertainty:1) Existential uncertainty: Inthiscase,atuplemayormaynotexist in the database, and the presence and absence of one

tuple may affect the probability of the presence or absence ofanother tuple in the database. In some cases, the tupleindependence assumption is used, according to which theprobabilities of presence of the different tuples are indepen-dent of one another. Furthermore, there may be constraintsthat correspond to mutual exclusivity of certain tuples in thedatabase. 2) Attribute level uncertainty: In this case, anumber of tuples and their modeling have already beendetermined. The uncertainties of the individual attributes aremodeled by a probability density function, or other statisticalparameters such as the variance.

2.3 Recent Projects

A number of recent projects have designed uncertaindatabases around specific application requirements. Forexample, the Conquer project [3], [42] introduced queryrewriting algorithms to extract clean and consistent answers

610 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009



3/15

fromunclean dataunder possible worlds semantics.Methodsare also proposed to derive probabilities of uncertain items.One of the key aspects of the Conquer project is that it permitsreal time and dynamic data cleaning in such a way that cleanand consistent answers may be obtained for queries. Anotherexampleofsuchadatabaseisthe Orion project [25],[28] whichpresents query processing and indexing techniques in orderto manage uncertainty over continuous intervals. Such

application-specific databases are designed for their corre-sponding domain, and are not very effective in extractinginformation from possible worlds semantics.

A recent and interesting line of models for uncertain datais derived from the Trio project [16], [62], [34] at StanfordUniversity. This work introduces the concept of Uncertainty-Lineage Database (ULDB), which is a database with bothuncertainty and lineage. We note that the introduction oflineage as a first-class concept within the database is a novelconcept which is useful in a variety of applications such asquery processing. The basic idea in lineage is that the modelkeeps track of the sources from which the data was acquiredand also keeps track of its influence in the database. Thus,

database with lineage can link the query results (or theresults from any potential application) to the source fromwhich they were derived. The probabilistic influence of thedata source on the final result is an important factor whichshould be accounted for in data management applications.Thus, data (or results) which are found to be unreliable arediscarded.

Finally, a recent effort is the MayBMS project [4], [5], [6] atCornell University. One advantage of this system is that itfits seamlessly into modern database systems. For example,this approach has a powerful query language which wasbuilt on top of PostgreSQL. Another unique feature of thesystem is that it uses the concept of U-relations in order tomaximize space-efficiency. Space-efficiency is a criticalfeature in uncertain database systems, since the uncertaintyresults in considerable expansion of the underlying databaserepresentation. Details of the most recent approach may befound in [6].

2.4 Extensions to Semistructured and XML Data

Recently, uncertain data models have also been extended tosemistructured and XML data. Some of the earliest work onprobabilistic semistructured data may be found in [66]. XMLdata poses numerous unique challenges. Since XML isstructured, the probabilities need to be assigned to the

structural components such as nodes and links. Furthermore,element probabilities could occur at multiple levels andnested probabilities within a subtree must be considered.Furthermore, incomplete data should be handled gracefullysince one may not insist on having complete probabilitydistributions. In order to handle the issue that there can benesting of XML elements, probabilities are associated withthe attribute values of elements in an indirect way. Theapproach is to modify the schema in XML so as to make anyattribute into a subelement. Thus, these new elements can behandled by the probabilistic system. Another unique issue inthe case of XML data is that the probabilities in an ancestor-descendent chain are related probabilistically.

In the most general case, this can lead to issues ofcomputational intractability. The approach in [66] is tomodel some classes of dependence (e.g., mutual exclusion)which are useful and efficient to model. The work in [66]

also designs techniques for a restricted class of queries onthe data. Another interesting approach to probabilistic XMLdata construction has been discussed in [50]. In thistechnique, probabilistic XML trees are constructed in orderto model the structural behavior of the data. The un-certainty in a probabilistic tree is modeled by introducingtwo kinds of nodes: 1) probability nodes, which enumerateall possibilities, and 2) possibility nodes, which have an

associated probability. The uncertainty in the differentkinds of nodes is modeled with the use of the kind function,which assigns node kinds. Furthermore, a prob function isused, which assigns probabilities to nodes. The queryevaluation technique enumerates all possible worlds in arecursive manner. The query is then applied to each suchenumerated world. Other related work on XML datarepresentation and modeling may be found in [79].

3 UNCERTAIN DATA MANAGEMENT APPLICATIONS

In this section, we will discuss the design of a number ofdata management applications with uncertain data. Theseinclude applications such as query processing, OnlineAnalytical Processing, selectivity estimation, indexing, and join processing. We will provide an overview of theapplication models and algorithms in this section.

3.1 Query Processing of Uncertain Data

In traditional database management, queries are typicallyrepresented as SQL expressions which are then executed onthe database according to a query plan. As we will see, theincorporation of probabilistic information has considerableeffects onthecorrectnessand computability ofthequery plan.

3.1.1 Intensional and Extensional SemanticsA given query over an uncertain database may requirecomputation or aggregation over a large number ofpossibilities. In some cases, the query may be nested, whichgreatly increases the complexity of the computation. Thereare two broad semantic approaches used:

. Intensional semantics. This typically models theuncertain database in terms of an event model(which defines the possible worlds), and use tree-like structures of inferences on these event combina-tions. This tree-like structure enumerates all thepossibilities over which the query may be evaluated

and subsequently aggregated. The tree-like enumera-tion results in an exponential complexity in evalua-tion time, but always yields correct results.

. Extensional semantics. Extensional semantics at-tempts to design a plan which can approximatethese queries without having to enumerate the entiretree of inferences. This approach treats uncertaintyas a generalized truth value attached to formulas,and attempts to evaluate (or approximate) theuncertainty of a given formula based on that of itssubformulas.

For the intensional case, the key is to develop a probabil-

istic relational algebra with intensional semantics whichalways yields correct results. It has been shown in [32] thatcertain queries have #P-complete data complexity underintensional semantics. Note that the extensional semantics

AGGARWAL AND YU: A SURVEY OF UNCERTAIN DATA ALGORITHMS AND APPLICATIONS 611



4/15

approach is mostly useful for simple expressions. When therelations are more complicated or nested, there may bedependencies in the underlying query results, which cannot be evaluated easily. Since intensional semantics uses acomprehensive enumeration-based approach, it alwaysyields correct results, whereas extensional semantics pro-vides an efficient heuristic, which is useful only when itworks approximately or correctly. In order to understand this

point, consider a possible worlds model of a database drawnfrom k possible tuples s1; s2; . . . ; sk, each of which haveprobability of presence in thedatabase equal to 0.2. The aimisto compute the probability that both s1 and s2 are present inthe database. An intensional plan would therefore require usto explicitly create the event variables es1, es2 for s1 and s2and compute the probability Pes1 \ es2. Note that eachof the variables es1 and es2 will depend upon how theunderlying database is modeled in terms of events, and willevaluate into a tree-like structure of inferences over possibleworlds of events. An accurate and efficient extensional planmay not be possible in this case. On the other hand, if thecorrelations among different tuples are extremely weak or

absent, then an efficient extensional plan would simplycompute this probability as Ps1 Ps2 0:04.

Clearly, a general method is required to reduce thequery evaluation complexity in relational databases. One ofthe earliest techniques for adding probabilistic informationinto query evaluation was discussed in [41]. This model is ageneralization of the standard relational model. In thismodel, probabilistic relations are treated as generalizationsof deterministic relations. Thus, even though deterministicmodels allow binary tuple weights, probabilistic relationsallow tuple weights which can vary between 0 and 1. Thebasic operators of relational algebra are modified in orderto take the weights into account during query processing.

Thus, while applying an operator of the relational algebra,the weights of the result tuples are computed as a functionof the tuple weights in the argument relation. A morerecent technique proposed in [32] designs a technique inwhich a correct extensional plan is available. We note thatsince the problem is #P-complete, a correct extensional planis not always available. However, many queries, whichoccur in practice, do admit a correct extensional plan.According to [32], 8 out of 10 TPC/H queries1 fall into thiscategory. For queries which do not admit a correctextensional plan, two techniques are proposed to constructresults which yield approximately correct answers. A fastheuristic is designed which can avoid large errors, and asampling-based Monte-Carlo algorithm is designed whichis more expensive, but can guarantee arbitrarily smallerrors. In addition, the technique in [32] also extends thesolution to the case of uncertain predicates on deterministicdata. A different imprecision model is discussed in [33] inwhich only the data statistics and explicit probabilities atthe data sources are used. It is shown in [33] that suchimprecisions can be modeled by a certain kind ofprobabilistic database with complex tuples correlations.The method in [32] is then used in order to rewrite thequeries for effective query resolution. We note that thework in [32] assumes tuple independence which is often

not the case for a probabilistic database. In the event that

possible worlds semantics are used, the algorithms forquery processing become much more difficult, since oneneeds to maintain consistency over the query answers. Thisproblem is also related to that of determining consistentquery answers in inconsistent databases [12].

3.1.2 Queries with Correlations

While thework in [32] assumes tuple independence, this may

not always be the case in many practical applications. Forexample, data from sensors [35] may be highly correlatedboth in terms of space and time. Furthermore, even if it isassumed that the tuples are independent, many intermediateresults of queries may contain complex correlations. Forexamples, even the simple join-operator is not closed undertuple independence. In [73], a technique has been proposedon querying correlated tuples with the use of statisticalmodeling techniques. The method in [73] constructs auniform framework which expresses uncertainties anddependencies through the use of joint probability distribu-tions. The query evaluation problem on probabilistic data-

basesis cast as an inference problem in probabilistic graphicalmodels [40]. Probabilistic graphical models form a powerfulclass of approaches which can compactly represent andreason about complex dependency patterns involving largenumbers of correlated randomvariables. Themain idea in theuse of this approach is the use of factored representations formodeling the correlations. A variety of algorithms may thenbe used on the probabilistic graphical model, and the exactchoice of algorithm depends upon the requirements foraccuracy and speed.

3.1.3 Top-k Query

A related query is the top-k query in which the aim is tofind the top-k answers for a particular query. The top-kranking is based on some scoring function in deterministicapplications. However, in uncertain applications, such aclean definition does not exist, since the process of reportinga tuple in a top-k answer does not depend only on its scorebut also on its membership probability. A further challengeis to use possible worlds semantics which can allowcomplex correlations among tuples in the database. Inorder to deal with the issue of possible worlds semantics,the technique in [74] uses generation rules which are logicalformulas that determine valid worlds. The interplay of thepossible worlds semantics with top-k queries requires the

careful redefinition of the semantics for the query itself. Forexample, consider the case of a radar-controlled trafficsystem [74], in which the radar readings may be in error because of multiple sources of uncertainty such as inter-ference from high-voltage lines and human identificationmistakes. Some examples of deterministic top-k queries areas follows:

. Determine the top-k speeding cars in the last hour.

. Determine a ranking over the models of the top-kspeeding cars.

While these queries are clear in the deterministic case,

they need to be reformulated for the case of uncertain andimprecise data. For example, all responses to the queriesneed to be defined in valid possible worlds in order toavoid answers inconsistent with generation rules and other


1. The TPC-H benchmark is a standard suite of decision support queriesdeveloped for benchmarking. More details can be found at http://www.tpc.org/tpch.



5/15

database constraints. Furthermore, in the second query, onemay wish to evaluate the top-k query in a most probableworld. The interaction between the most probable and thetop-k results in different possible interpretations ofuncertain top-k queries:

. The top-k tuples in the most probable world.

. The most probable top-k tuples that belong to

valid possible worlds.. The set of most probable top-ith tuples across all

possible worlds, where i 1; . . . ; k.We note that the interpretations of the queries above

involve both ranking and aggregations across possibleworlds. The work in [74] models uncertain top-k queriesas a state-space search problem, and introduces severalspace navigation algorithms with optimality guarantees onthe number of accessed tuples in order to find the mostprobable top-k answers. In order to model the state-spaceprobabilities, a Rule Engine is used, which is responsible forcomputing the state-space probabilities. This Rule Enginecan be modeled in the form of a Bayesian Network [40]. Thework in [74] also creates a framework for integrating spacenavigation algorithms and data access methods for lever-aging existing DBMS technologies. One key result pre-sented in [74] is that among all sequential access methods,the retrieval of tuples in the order of their scores leads to theleast possible number of accessed tuples to answeruncertain top-k queries.

3.1.4 The OLAP Model

One interesting data model for query processing is that ofthe OLAP model. The queries which are most relevant tothe OLAP setting are aggregation queries, in which one

attempts to aggregate a particular function of the data on apart of the data cube. The earliest work in the aggregationsetting was discussed in [23], [59], [70], and [71]. Much ofthis work does not relate directly to OLAP queries in thesense that while they provide aggregation functions, theydo not use the domain hierarchies which are inherent in theOLAP environment. The earliest work in the OLAP settingwas discussed in [64], which considers the semantics ofaggregate queries in an uncertain environment. However,this technique does not consider the implications of anOLAP setting which uses domain hierarchies in order todefine the data.

In [20], a crisp set of criteria has been identified in order

to handle ambiguity. The criteria which are identified in[20] are as follows:

. Consistency. This criterion discusses the concept ofconsistency from the OLAP perspective. This ac-counts for the relationship between similar querieswhich are issued at related nodes in a domainhierarchy in order to meet users intuitive expecta-tions as they navigate up and down the hierarchy.For example, for the case of a SUM query, the SUMfor a query region should be equal to the valueobtained by adding the results of SUM for the querysubregions that partition the region.

. Faithfulness. This captures the notion that moreprecise data should lead to more accurate results.For example, for a SUM query over nonnegativemeasures, as the imprecision in the data increases

and grows outside the query region, it is expectedthat the result of the SUM query should benonincreasing.

. Correlation-preservation. This requires that thecorrelation properties of the data should not beaffected by the allocation of ambiguous data records.For example, the computation of the SUM under atuple-specific constraint will be affected by the

correlations among different tuples.In order to model the uncertainty, the work in [20]

relaxes the restriction that the dimension attributes must beassigned leaf-level values from the domain hierarchy. Forexample, we can denote that a repair took place in Texaswithout specifying a city explicitly. This has implicationsfor how queries are answered: if a query aggregates repaircosts in Austin, should the example repair be included, andhow? The second extension is to introduce a new measureattribute which represents uncertainty. This is in the form ofa probability distribution function over the base domain.Two broad approaches are proposed in [20] in order to dealwith these different kinds of uncertainty:

. Query allocation. In this case, data which is assignedto higher levels of the hierarchy needs to be allocatedto lower level leaf nodes by partial assignment. Thispartial assignment is captured by the weights on theassignment to nodes of different level. For responseconsistency, it is reasonable to expect that thisassignment should be query independent.

. Aggregating uncertain measures. In this case, thequery needs to aggregate over different probabilitydensity functions. The problem of aggregating pdfs isclosely related to a problem studied in the statisticsliterature, which is that of opinion pooling [44]. Theopinion-pooling problem is to form a consensus opinionfrom a given set of opinions . The set of opinions aswell as the consensus opinion are presented as pdfsover a discrete domain O.

The work in [20] also allows a possible worlds interpreta-tion of a database D containing imprecise facts, as a preludeto defining query semantics. If an imprecise fact r maps ontoa region R of cells, then each cell in R represents a possiblecompletion of r that eliminates the imprecision in r. Moredetails may be found in [20]. Since imprecise data can oftencontain domain constraints in order to avoid inconsistency, akey issue is the extension of this model to the constrained

case. In [21], the regularities in the constraint space arecaptured with the use of a constraint hypergraph in order toprovide efficient answers to such queries.

3.2 Indexing Uncertain Data

The problem of indexing uncertain data arises frequently inthe context of several application domains such as movingtrajectories or sensor data. In such cases, the data is updatedonly periodically in the index, and therefore the currentattribute values cannot be known exactly; they can only beestimated. There are many different kinds of queries whichcan be resolved with the use of index structures:

. Range queries. In range queries, the aim is to find allthe objects in a given range. Since the objects areuncertain, their exact positions cannot be known,and hence their membership in the range also cannot




6/15

be known deterministically. Therefore, a probabilityvalue is associated for each object to belong to arange. All objects whose probability of membershiplies above a certain threshold are retained.

. Nearest neighbor queries. In nearest neighborqueries, we attempt to determine the objects withthe least expected nearest neighbor distance to thetarget. An alternative way of formulating the

probabilistic nearest neighbor query is in terms ofthe nonzero probability that a given object is thenearest neighbor to the target.

. Aggregate queries. In such queries, the aim is todetermine the aggregate statistics from queries suchas the sum or the max. Aggregate queries areinherently more difficult than other kinds of queriessuch as range or nearest neighbor queries because onehas to account for the interplay of different objects.

In [25], a broad classification of the queries has beenprovided in the context of index structures. Queries canoften be classified depending upon the nature of theanswers. An entity-based query returns a set of objects thatsatisfy the condition of the query. A value-based queryreturns a single value, examples of which include thequerying of the value of a particular dimension, orcomputing some statistical function of a set of objectssatisfying query constraints (e.g., average, max). Anotherproperty which can be used to classify queries is whether ornot aggregation is involved. In [25], broad classes of queryprocessing techniques have been discussed for each of thesedifferent kinds of queries.

3.2.1 Moving Object Environments

An important domain for indexing and querying imprecise

data is that of moving object environments [28]. In suchenvironments, it is infeasible for the database tracking themovement of the objects to store the exact locations of theobjects at all times. The location of an object is known withcertainty only at the time of the update. Between twoupdates, the uncertainty of the location increases till thenext update. The error in answers to queries can becontrolled by limiting the level of uncertainty.

Several specific models of uncertainty are possible for thecase of moving objects. One popular model for uncertaintyis that, at any point in time, the moving object is within acertain distance d of its last reported position. If the object

moves further than this distance, it reports its new location,and relocates its anchor point to the new reported position.Other models for uncertainty may assume specific patternsof movement such as that in a straight line. In such cases,the objects are assumed to lie in an interval along a straightline. In the case of [28], the uncertainty of a moving point ischaracterized in a fairly general way.

Definition 3.1. An uncertainty region Uit of an object Oi attime t is a closed region such that Oi can be found only in this

region.

Definition 3.2. The uncertainty density function fix;y;t isthe probability density function of the object Oi at locationx; y and time t. This uncertainty function has a value of 0outside Uit.

We note that this is a fairly general model of uncertaintyin that it does not assume any specific behavior of the objectinside Uit.

Aside from the standard range query, the work in [28]also tackles the probabilistic nearest neighbor query. In theprobabilistic nearest neighbor query, the aim is to deter-mine probabilistic candidates for the nearest neighbor of agiven target along with corresponding probability values.

The process of responding to a probabilistic range queryis fairly straightforward. In this case, the probability densityfunction is integrated over the entire range of the query. Allobjects for which this probability value lies above a certainthreshold are reported.

The technique for processing a probabilistic nearestneighbor query involves evaluating the probability of eachobject being closest to the query point. One of the keychallenges of the nearest neighbor query is that unlike theprobabilistic range query, one cannot determine the prob-ability for an object independent of the other points. Thesolution basically comprises the steps of projection, pruning,

bounding, and evaluation. These steps are summarized asfollows:

. Projection. In this phase, the uncertainty region ofeach moving object is computed based on theuncertainty model used by the application. Theshapes of the uncertainty regions are determined bythe uncertainty model used, the last recordedposition of the object Oi, the time elapsed since thelast update, and the maximum speeds of the objects.

. Pruning phase. This allows us to explicitly prunesome of the objects without having to go through the

expensive process of computing nearest neighborprobabilities. For example, if the shortest distance ofthe target to one uncertain region is greater than thecorresponding longest distance of the target toanother region, then it is possible to prune theformer. Therefore, the key to the algorithm is tofind f, the minimum of the longest distances of theuncertainty regions from the target q. Then, anyobject for which the shortest distance to the target qis larger than f is eliminated.

. Bounding phase. The pruning can be extended toportions of uncertainty regions which cannot becompletely pruned. For each element, there is no

need to examine all portions of the uncertaintyregion. It is necessary to only look at the regions thatare located no farther than f from the target point q.This can be conceptually achieved by drawing abounding circle C of radius f centered at q. Anyportion of the uncertainty region outside C can beignored.

. Evaluation phase. In this phase, one calculates foreach object, the probability that it is indeed thenearest neighbor to the target O. The solution is basedon the fact that the probability of an object o being thenearest neighbor with distance r to the target q is

given by the probability ofobeing at a distance r fromq times the probability that every other object is at adistance r or larger from q. This value can then beintegrated over different values of r.




7/15

3.2.2 Probabilistic Threshold Queries

A related work in [24] proposes the concept of probabilisticthreshold queries. In such queries, the aim is to determine allobjects whose behavior satisfies certain conditions with aminimum probability. The formal definition is as follows:

Definition 3.3. Given a closed interval c; d, where c, d2 R andc d, a probabilistic threshold query returns a set of tuples Ti,such that the probability pi that Ti:a is inside c; d, is greaterthan or equal to p, where 0 p 1. W e n ote thatTi:a represents the probability attribute of tuple Ti.

Thus, a probabilistic threshold query can be treated as arange query, which operates on probabilistic uncertaintyinformation, and returns items whose probabilities ofsatisfying the query exceed p.

A number of index structures have been proposed in [24]in order to resolve this query. A naive way of evaluating thequery is to first find all the tuples whose uncertaintyintervals have some overlap with the corresponding range.Once these tuples have been determined, the correspondingprobability of intersection can be determined in a straight-forward way. In order to find all the tuples which intersectover a given range, it is necessary to build an indexstructure over different intervals, and apply a range searchover the index for the prespecified interval. This canunfortunately be quite inefficient. The second problem isthat of the probability of each element in the data needs to be evaluated. If many items overlap with the specifiedinterval, but only a few have probability of inclusion greaterthan p, then this can be quite inefficient.

A different solution proposed in [24] is referred to asProbability Threshold Indexing. This index structure isessentially based on a modification of a 1D R-Tree, whereprobability information is augmented to its internal nodesin order to facilitate pruning. In a traditional R-Tree, a rangequery is resolved by examining only those nodes of thetree which intersect with the user-specified range. Thisidea can be generalized by constructing tighter bounds(called x-bounds) than the Minimum Bounding Rectangle(MBR) of each node. Let Mj denote the MBR/uncertaintyinterval represented by the jth node of an R-Tree, orderedby preorder traversal. Then, the x-bound ofMj is defined asfollows:

Definition 3.4. An x-bound of an MBR/uncertainty interval Mjis a pair of lines, namely left-x-bound (denoted by Mj:lb

x

)

and right-x-bound (denoted by Mj:rbx). Every uncertainobject contained in this MBR is guaranteed to have aprobability of at most x (where 0 x 1) of being left ofthe left-x-bound and also guaranteed to have a probability of atmost x of being right of the right-x-bound.

We note that this kind of bound is a generalization of theconcept of the MBR. This is because the MBR of an internalnode can be viewed as a 0-bound. This is because itguarantees that all intervals in the node are contained in itwith probability 1.

The purpose of storing the information of the x-bound of

a node is to avoid investigating the contents of a node. Thissaves I/O costs during index exploration. The presence ofthe x-bound allows us to decide whether an internal nodecontains any qualifying MBRs without further probing into

the subtrees of this node. Let p be the threshold probability

for the query. The two necessary pruning conditions (bothconditions must hold) for node Mj to be pruned with the

use of the x-bound are as follows:

. Mj can be pruned if a; b does not intersect left-x-bound or right-x-bound of Mj, i .e., eitherb < Mj:lb

x

or a > Mj:rb

x

.

. p ! x.In the event that the above conditions do not hold, the

internal contents of node Mj are examined and further

exploration of the tree is resumed. It has been shown in [24]that the probability threshold query (PTQ) index is quite

efficient when the threshold p is fixed a priori across all

queries. When the threshold p varies, then the index

continues to be experimentally efficient on the average,though the actual behavior mat vary quite a bit across

different queries.

3.2.3 Uncertain Categorical Data

A method for indexing uncertain categorical data has beendiscussed in [76]. The definition used in [76] for thecategorical data domain is as follows:

Definition 3.5. Given a discrete categorical domain

D fd1; . . . ; dNg, an uncertain discrete attribute (UDA) uis a probability distribution over D. It can be represented by

the probability vector u:P fp1; . . . ; pNg s uc h th atP ru di u:pi.

The probability that two uncertain attribute values are

equal can be computed by calculating the corresponding

equality probability over all possible uncertain values.Therefore, we have the following.

Observation 3.1. Given two UDAs u and v, the probability that

they are equal is given by P ru v PNi1 u:pi v:pi.Analogous to the notion of equality of value is

distributional similarity. The distance function may be

defined in terms of the L1 function, the L2 function, or the

Kullback-Leibler distance function. The kinds of queries

resolved by the technique in [76] are as follows:

. Probabilistic equality query (PEQ). Given a UDA q,

and a relation R with a UDA a, the query returns alltuples t from R along with probability values, suchthat the probability value P rq t:a ! 0.

. Probabilistic equality threshold query (PETQ).Given a UDA q, a relation R with UDA a, and athreshold , ! 0. The answer to the query is alltuples t from R such that P rq t:a ! .

. Distributional similarity threshold query (DSTQ).Given a UDA q, a relation R with UDA a, a thresholdd, and a divergence function F, DSTQ returns alltuples t from R such that Fq;t:a d.

. Probabilistic equality threshold join (PETJ). Given

two uncertain relations R, S, both with UDAs a, b,respectively; relation R fflRaSb; S consists of allpairs of tuples r, s from R, S, respectively, such thatP rr:a s:b ! .




8/15

In [76], two separate index structures are proposed inorder to resolve the queries on categorical uncertain data.The first index is the probabilistic inverted index. In theprobabilistic inverted index, for each value in the categoricaldomain, a list of the tuple-ids is stored, which have a nonzeroprobability of taking on that particular value. Along witheach tuple-id, this probability value is also stored. The innerlists containing the tuple-ids are often organized as a

dynamic structure such as the B-Tree in order to facilitateinsertions and deletions. As in any inverted index, theinsertion and deletion are extremely straightforward. Oneonly needs to determine the corresponding list(s), and insertor delete the corresponding tuple-id.

The inverted index can be used in conjunction withvarious pruning techniques in order to answer PETQs. Thefirst step is to determine all thetuples in thedifferent invertedlists which match the target parameters of the query. Fromthese candidate tuples, only those which qualify more thanthe threshold are retained. A variety of other pruningtechniques can be used in order to improve the efficiency ofthe different queries. The different techniques discussed in

[76] include row pruning, column pruning, and approacheswhich examine the lists in a highest-probability first fashion.The effectiveness of these different techniques for differentkinds of queries is discussed in [76].

3.2.4 Probabilistic Distribution R-Tree

Next, we will discuss the probabilistic distribution R-Treewhich is an alternative for indexing UDAs. The broadapproach is to index the vector of probability values of the possibleattribute values. Thus, if there are N possible probabilityvalues then, data points are created in RN. One distinctionfrom traditional R-Trees is that the underlying queries havevery different semantics. The uncertain queries are hyper-plane queries on the N-dimensional cube. The MBRs of thisR-Tree are thus defined in terms of the correspondingprobability values. This ensures that the essential pruningproperties of R-Trees are maintained. For example, for thecase of probabilistic threshold query, one can compute themaximum probability of equality for any node in the subtree by taking the maximum dot product of the target objectprobabilities with the corresponding probability vector fromthe MBR. When this value is less than the user-specifiedthreshold, the corresponding subtree can be pruned. The twodifferent index structures for categorical data have beentested in [76]. The results suggest that neither of the two

techniques emerges as a clear winner, and either of thetechniques may perform better depending upon the natureofthe query and the underlying data.

3.2.5 Other Work

Most of the above techniques make certain assumptionsabout the underlying probability distributions. An interest-ing technique discussed in [77] examines the problem for thecase of arbitrary probability density functions. In this case, ageneral assumption is made about the probability distribu-tion functions, in the sense that they are not all assumed to beeven of the same type. For example, the uncertainty functionfor one object could be uniform, whereas the uncertainty

function for another object could be Gaussian. This makesthe problem much more difficult from the point of view ofindexing, search, and pruning. In [77], an index structurecalled the U-Tree has been proposed, which can handle such

kinds of queries. Other methods for indexing arbitraryprobability distributions have been discussed in [19] and[57]. Finally, an interesting method called the Gauss-Tree[18] has been proposed for the case of probabilistic featurevectors. This tree has been shown to retain effectiveness forprobabilistic retrieval. A detailed discussion is beyond thescope of this survey.

3.3 Join Processing on Uncertain DataIn the case of join processing, techniques have beendeveloped for probabilistic join queries and similarity joins.In the case of probabilistic join queries, it is assumed thateach item is associated with a range of possible values and aprobability density function, which quantifies the behaviorof the data over that range. The range of values associatedwith the uncertain variable a are denoted by a:U a:l;a:r.Thus, a:l is the lower bound of the range and a:r is theupper bound of the range. By incorporating the notion ofuncertainty into data values, imprecise answers are gener-ated. Each join-pair is associated with a probability toindicate the likelihood that the two tuples are matched. Asecond kind of join [53] is the similarity join. Similarity ismeasured by the distance between the two feature vectors.The join is performed based on this distance.

3.3.1 Probabilistic Join Queries

Since each tuple-pair is probabilistic in nature, the join maycontain a number of false positives which are typicallythose pairs which are associated with probability values.Each tuple-pair is associated with a probability thatindicates the likelihood of the join. In order to computethese probability values, the notions of equality andinequality need to be extended to support uncertain data.

We note that those join-pairs which have low probabilitycan be discarded. This variant of probabilistic join queriesare referred to as Probabilistic Threshold Join Queries. We notethat the use of thresholds reduces the number of falsepositives, but it may also result in the introduction of falsenegatives. Thus, there is a tradeoff between the number offalse positives and false negatives depending upon thethreshold which is chosen. The reformulation of the joinqueries with thresholds is also helpful in improving theperformance requirements of the method.

A number of pruning techniques are developed in orderto improve the effectiveness of join processing. These

pruning techniques are as follows: 1) Item-level pruning:In this case, two uncertain values are pruned withoutevaluating the probability. 2) Page-level pruning: In thiscase, two pages are pruned without probing into the datastored in each page. 3) Index-level pruning: In this case, thedata which is stored in a subtree is pruned.

We note that a key operator in the case of joins is that ofequality, since a join is performed only when the corre-sponding attribute values are equal. For the case ofcontinuous data with infinitesimal resolution, this is neverthe case since any of the pair of attributes can take on aninfinite possible number of values. Therefore, a pair of

attributes are defined to be equal to one another withinacceptable resolution c, if one attribute value is within c ofanother. Let a and b be the two join attributes. Let a:fxand b:Fx represent the corresponding probability density




9/15

and cumulative density functions, respectively. Corre-spondingly, the probability can be calculated as follows:

Pa c b Z1

1a:fx b:Fx c b:Fx c dx: 1

For the case of the > and < operators, it is not necessary to

use the resolution, and it is possible compute the corre-sponding probability of inequality Pa > b and Pa < b ina straightforward way. In order to evaluate the join,common block-nested-loop and indexed-loop can be used.The advantage of these algorithms is that they have beenimplemented in most database systems, and therefore onlya small amount of modification is required in order tosupport the joins. The main difference is to use theuncertainty information in order to compute the probabilityof equality. For the use of probability density functions suchas the uniform or the Gaussian function, closed formformulas may be obtained in order to determine theprobability of equality. Subsequently, those pairs with

probability less than the required threshold can be pruned.We note that the computations of the probability of a join

can sometimes be expensive when the probabilistic com-putations cannot be expressed in closed form. Therefore, itis often useful to be able to develop quick pruningconditions in order to exclude certain tuple pairs from thejoin. Suppose a and b are uncertain valued variables anda:U\ b:U6 . Let la;b;c; be maxfa:l c;b:l cg, and let ua;b;cbe minfa:r c;b:r cg. For equality and inequality, thefollowing pruning conditions hold true:

. Pa c b is at most minfa Fua;b;c a Fla;b;c;b Fua;b;c b Fla;b;cg.

. Correspondingly, it is easy to see that Pa 6c b isat least equal to the complement of the aboveexpression.

We note that the above expressions can be computedeasily as long as the cumulative density function of theexpression is available either in closed or numerical form. Wenote that a tuple pair can be eliminated when the probabilityof equality is less than the user-defined threshold.

We further note that in some cases, it may not benecessary to report the explicit probabilities of the tuple joins, as long as all tuples whose join probability is abovethe user-defined threshold are reported. For such cases, it isonly necessary to determine whether the required prob-ability lies above a given threshold. For such cases, we canuse another pruning condition.

For a pair of uncertain-valued variables a and b, it ispossible to compute a bound on the correspondingprobability that one is greater than the other. Specifically,the bounds are as follows:

. If a:l b:r < a:r, Pa > b ! 1 a Fb:r.

. If a:l b:l a:r, Pa > b 1 a Fb:l.The detailed proof of these results is described in [27].

The above two inequalities can be used for those join tuples

which satisfy the preconditions described above. Depend-ing upon the direction of the inequality, one can immedi-ately include or exclude the corresponding join tuples fromthe inequality.

We note that in many of these join processing algorithms,the unit of retrieval is a page from an index structure. Insuch cases, one can prune the entire node of the index treeby constructing bounds on the join behavior of the nodes inthe tree. By using this approach, either page-level pruningcan be achieved, or index-level pruning can be achieved byusing an inner level node in the index tree. A concept calledthe x-bound is proposed in [26], and is used to augment the

nodes of the underlying index structure. For more details,we refer the reader to [26]. Another recent method forspatial joins is discussed in [58].

3.3.2 Similarity Join

The most popular similarity join is the distance-range join. Inthe distance-range join, we perform the join between tworecords, if the distance between the two does not exceed auser-defined parameter . The natural generalization for thecase of uncertain data is to compute the expected distancebetween two relations, and perform the join if this expecteddistance is less than the parameter . This may result inconsiderable inaccuracies in the join computation process.

This is because the expected distances are often skewed bythe behavior of the tail end behavior of the probabilityfunctions of different attributes. Thus, the expected dis-tances may not reflect the true likelihood that a given pair ofrecords may join on a particular attribute. The result is thatdifferent joins which have similar probability of lying withinthe range of may be treated inconsistently. Therefore, it hasbeen proposed in [53] to assign a probability value to eachobject pair. This probability value reflects the likelihood thatthe object pair belongs to the join result set. Only those joinpairs which have nonzero probability of belonging to the join-result set are returned. In order to define this prob-ability, one needs to quantify whether the distance betweena pair of joining attributes lies within a certain range. To doso, the method in [53] computes the probability that thedistance between the pair does not exceed . We note that inthe deterministic case, when the distances are known, thisdistance function is the dirac-delta function. Thus, thedeterministic case is a special case of the uncertain similarityjoin algorithm.

3.4 Data Integration with Uncertainty

An important application in the context of uncertain data isthat ofdata integration. A first approach to this problem hasbeen discussed in [37]. In order to do so, the work in [37]

introduces the concept ofprobabilistic schema mappings. Theseare defined as a set ofpossible (ordinary) mappings between asource schema and a target schema, where each possiblemapping has an associated probability. It is suggested thatthere are two possible interpretations to probabilistic schemamappings. The first (table-specific mapping) assumes thatthere is a singlecorrect mapping,but wedo not knowwhich itis. This single correct mapping applies to all tuples. In thesecond interpretation (tuple-specific mapping), the mappingdepends upon the tuple to which it is applied.

A number of algorithms are described in [37] foranswering queries in the presence of probabilistic schemamappings. It has been shown that in the case of table-

specific mappings, the data complexity is PTIME, and inthe case of tuple-specific mappings, the complexity is#P-complete. Therefore, the second case is much moredifficult. Nevertheless, it has been shown in [37] that for




10/15

large classes of real-world queries, it is possible to obtainall the answers in PTIME. More details on the specificalgorithms may be found in [37].

3.5 Probabilistic Skylines on Uncertain Data

A problem which is quite relevant to the case of uncertaindata is that of probabilistic skyline computation. The workin [65] provides a first approach to this problem. The

problem of skyline computation is used in multicriteriadecision-making applications. For example, consider thecase when statistics of different NBA players are computed,such as the number of assists, rebounds, baskets, etc. It isunlikely that a single player will achieve the bestperformance in all respects. Therefore, the concepts ofdominance and skyline are defined [65] as follows:

Definition 3.6. For two d-dimensional points u u1; . . . ; udand v v1; . . . ; vd, u is said to dominate v, if for eachi 2 f1; . . . ; dg, we have ui vi, and for some i0 2 f1; . . . ; dg,we have ui0 < vi0 .

The above definition assumes that smaller values aremore preferable, though it is easy enough to create adefinition in which larger values may be preferable for oneor more of the dimensions. The concept of dominance canbe used in order to formally define the concept of a skyline.

Definition 3.7. Given a set of points S, a point u is a skyline pointif there exists no other point v 2 Ssuch that v dominates u. Theskyline on S is the set of all skyline points.

Clearly, all players that lie on the skyline may beconsidered outstanding players. Most skyline analyses onlyuse certain data in the form of the mean performance of the

different players. In practice, the performance of a player ondifferent criteria may vary substantially from game to game.For example, it is known that most players are far moreeffective, when playing on their home court. Therefore, it ispossible to improve the quality of the analysis by usinguncertainty information.

The key challenge in skyline computation is to capturethe dominance relationship between uncertain objects.Therefore, the concept of probabilistic skyline was proposedin [65]. In this case, the probability of an object being in theskyline is the probability that the object is not dominated byany other objects.

Definition 3.8. Given a probability threshold p0

p

1

, the

p-skyline is the set of uncertain objects, such that each of themhas probability of at least p to be in the skyline.

Constructing a probabilistic skyline is much morecomplicated, because in many applications, the probabilitydensity function of uncertain data objects is not availableexplicitly. Only a set of instances are collected in order toapproximate the probability density function. For example,in the case of the NBA example, the instances correspond tothe game-by-game performance of a particular player,whereas the uncertain object corresponds to the distributionof a particular players performance. One possible solution

is to apply the skyline approach on the entire collected set ofinstances. However, this can be inefficient in practice, whenthe set of collected instances are very large compared to theunderlying objects on which the skylines are computed.

In [65], two algorithms are proposed. The first is abottom-up algorithm which computes the skyline probabil-ities of some selected instances of objects, and uses thoseinstances to prune other instances and uncertain objectseffectively. The second is a top-down algorithm whichrecursively partitions the instances of uncertain objects intosubsets, and prunes subsets and objects aggressively. Boththe top-down and bottom-up algorithms use the bounding-

pruning-refining iteration. In the case of the bottom-upalgorithm, the steps are as follows:

. Bounding. For an instance of an uncertain data object,we compute an upper bound and lower bound of itsskyline probability. Then, we can convert this boundto the skyline probability of an uncertain object.

. Pruning. If the lower bound of an uncertain objectU is larger than the threshold p, then it lies in thep-skyline. If the upper bound is less than p, then Uis not in the p-skyline.

. Refining. For all objects which cannot be conclu-sively determined to be either excluded or included

in the skyline, we need to get tighter bounds for thenext iteration of bounding, pruning, and refining.

An important observation here is that in this method, wecompute and refine the bounds of instances of uncertainobjects by selectively computing the skyline probabilities ona small subset of instances. This technique is called bottom-up, since the bound computation and refinement start frominstances in the bottom, and go up to skyline probabilities ofobjects. We refer the reader to the details of how thebounding and refinement are performed to [65].

4 MINING APPLICATIONS FOR UNCERTAIN DATA

Recently, a number of mining applications have beendevised for the case of uncertain data. Such applicationsinclude clustering and classification. We note that thepresence of uncertainty can affect the results of datamining applications significantly. For example, in the caseof a classification application, an attribute which haslower uncertainty is more useful than an attribute whichhas a higher level of uncertainty. Similarly, in a clusteringapplication, the attributes which have a higher level ofuncertainty need to be treated differently from thosewhich have a lower level of uncertainty.

4.1 Clustering Uncertain DataThe presence of uncertainty changes the nature of theunderlying clusters, since it affects the distance functioncomputations between different data points. A techniquehas been proposed in [51] in order to find density-basedclusters from uncertain data. The key idea in this approach isto compute uncertain distances effectively between objectswhich are probabilistically specified. The fuzzy distance isdefined in terms of the distance distribution function. Thisdistance distribution function encodes the probability thatthe distances between two uncertain objects lie within acertain user-defined range. Let dX; Y be the random

variable representing the distance betweenX

andY

. Thedistance distribution function is formally defined as follows:

Definition 4.1. Let X and Y be two uncertain records, and letpX; Y represent the distance density function between these




11/15

objects. Then, the probability that the distance lies within therange a; b is given by the following relationship:

P a dX; Y

b Zb

a

pX; Yzdz: 2

Based on this technique and the distance density function,the method in [51] defines a reachability probability betweentwo data points. This defines the probability that one datapoint is directly reachable from another with the use of apath, such that each point on it has density greater than aparticular threshold. We note that this is a direct probabilisticextension of the deterministic reachability concept which isdefined in the DBSCAN algorithm [38]. In the deterministicversion of the algorithm [38], data points are grouped intoclusters when they are reachable from one another by a pathwhich is such that every point on this path has a minimum

threshold data density. To this effect, the algorithm uses thecondition that the -neighborhood of a data point shouldcontain at least MinPts data points. The algorithm starts offat a given data point and checks if the neighborhoodcontains MinPts data points. If this is the case, the algorithmrepeats the process for each point in this cluster and keepsadding points until no more points can be added. One canplot the density profile of a data set by plotting the number ofdata points in the -neighborhood of various regions, andplotting a smoothed version of the curve. This is similar to theconcept of probabilistic density estimation. Intuitively, thisapproach corresponds to the continuous contours of inter-

section between the density thresholds in Figs. 1 and 2 withthe corresponding density profiles. The density thresholddepends upon the value ofMinPts. Note that the data pointsin any contiguous region will have density greater than thethreshold. Note that the use of a higher density threshold(Fig. 2) results in three clusters, whereas the use of a lowerdensity threshold results in twoclusters. The fuzzy version ofthe DBSCANalgorithm (referred to as FDBSCAN) works in asimilar way to the DBSCAN algorithm, except that thedensity at a given point is uncertain because of the underlinguncertainty of the data points. This corresponds to the factthat the number of data points within the -neighborhood ofa given data point can be estimated only probabilistically,

and is essentially an uncertain variable. Correspondingly,the reachability from one point to another is no longerdeterministic, since other data points may lie within the-neighborhood of a given point with a certain probability,

which may be less than 1. Therefore, the additional constraintthat the computed reachability probability must be greaterthan 0.5 is added. Thus, this is a generalization of thedeterministic version of the algorithm in which the reach-ability probability is always set to 1.

Another related technique discussed in [52] is that ofhierarchical density based clustering. An effective (deter-ministic) density based hierarchical clustering algorithm isOPTICS [13]. We note that the core idea in OPTICS is quitesimilar to DBSCAN and is based on the concept ofreachability distance between data points. While the methodin DBSCAN defines a global density parameter which is usedas a threshold in order to define reachability, the work in[52] points out that different regions in the data may havedifferent data density, as a result of which it may not bepossible to define the clusters effectively with a singledensity parameter. Rather, many different values of thedensity parameter define different (hierarchical) insightsabout the underlying clusters. The goal is to define animplicit output in terms of ordering data points, so thatwhen the DBSCAN is applied with this ordering, one canobtain the hierarchical clustering at any level for differentvalues of the density parameter. The key is to ensure thatthe clusters at different levels of the hierarchy are consistentwith one another. One observation is that clusters definedover a lower value of are completely contained in clustersdefined over a higher value of , if the value of MinPts isnot varied. Therefore, the data points are ordered basedon the value of required in order to obtain MinPts in the-neighborhood. If the data points with smaller values of

are processed first, then it is assured that higher densityregions are always processed before lower density regions.This ensures that if the DBSCAN algorithm is used fordifferent values of with this ordering, then a consistentresult is obtained. Thus, the output of the OPTICS algorithmis not the cluster membership, but it is the order in whichthe data points are processed. We note that that since theOPTICS algorithm shares so many characteristics with theDBSCAN algorithm, it is fairly easy to extend the OPTICSalgorithm to the uncertain case using the same approach asthat was used for extending the DBSCAN algorithm. This isreferred to as the FOPTICS algorithm. Note that one of thecore-concepts needed to order to data points is to determine

the value of which is needed in order to obtain MinPts inthe corresponding neighborhood. In the uncertain case, thisvalue is defined probabilistically, and the correspondingexpected values are used to order the data points.


Fig. 1. Density-based profile with lower density threshold.Fig. 2. Density-based profile with higher density threshold.



12/15

Finally, a technique in [63] uses an extension of theK-means algorithm in order to cluster the data. Thistechnique is referred to as the UK-means algorithm. In UK-means, an object is assigned to the cluster whose representa-tive has the smallest expected distance to the object. We notethat expected distance computation is an expensive task.Therefore, the technique in [63] uses a number of pruningoperations in order to reduce the computational load. The

idea here is to use branch-and-bound techniques in order tominimize the number of expected distance computationsbetween data points and cluster representatives. The broadidea is that once an upper bound on the minimum distance ofa particular data point to some cluster representative hasbeen quantified, it is necessary to perform the computationbetween this point and another cluster representative, if itcan be proved that the corresponding distance is greater thanthis bound. This approach is used to design an efficientalgorithm for clustering uncertain location data.

The techniques in [51] and [63] were developed for thecase of static data. Recently, the problem of clusteringuncertain data has also been extended to the case of data

streams [10]. In order to do so, the microclustering conceptdeveloped in [11] is extended to the uncertain case. In orderto incorporate uncertainty into the clustering process,additional information about error statistics are incorporatedinto microclusters. It has been shown in [10] that it is possibleto efficiently cluster uncertain data streams with the use ofsuch an approach. More recently, approximation algorithms[31] have been proposed for clustering uncertain data.

4.2 Classification of Uncertain Data

A closely related problem is that of classification ofuncertain data in which the aim is to classify a test instanceinto one particular label from a set of class labels. In [17], amethod was proposed for support vector machine classifi-cation of uncertain data. This technique is based on adiscriminative modeling approach which relies on a totalleast squares method. This is because the total least squaresmethod assumes a model in which we have additive noise.However, instead of using Gaussian noise, the technique in[17] uses a simple bounded uncertainty model. Such amodel has a natural and intuitive geometric interpretation.Note that the support vector machine technique functionsby constructing boundaries between groups of data records.Then, the margin created by the support vector machine canbe modified by using the uncertainty of the points which lie

on the boundary. For example, if points on one side of theboundary have greater uncertainty, this influences the wayin which the margins are adjusted by the classifier. This is because the uncertainty in the data may result in someprobability that the uncertain data point is on either side ofthe SVM boundary. The key idea in [17] is to provide ageometric algorithm which optimizes the probabilisticseparation between the two classes on both sides of theboundary. Thus, the main difference from a standard SVMapproach is to use the probability that a given data point lieson either side of the boundary while computing the degreeof separation between the two classes.

4.3 Frequent Pattern MiningThe problem of frequent pattern mining has also beenexplored in the context of uncertain data. In this model, it isassumed that each item has an existential uncertainty in

belonging to a transaction. This means that the probabilityof an item belonging to a particular transaction is modeledin this approach. In this case, an item set is defined to befrequent, if its expected support is at least equal to a user-specified threshold.

In order to solve this version of the frequent patternmining problem, the U-Apriori algorithm is proposed whichessentially mimics the Apriori algorithm, except that itperforms the counting by computing the expected supportof the different item sets. The expected support of a set ofitems in a transaction is obtained by simply multiplying the

probabilities of the different items in the transaction. Theapproach can be made further scalable by using the conceptofdata trimming. In the data trimming approach, those itemswith very low existential probability are pruned from thedata. The algorithm is then applied to the trimmed data. In[29], it has been shown that this approach can accuratelymine the frequent patterns while maintaining efficiency.Further pruning tricks for improving the efficiency offrequent pattern mining algorithms may be found in [30].Methods for finding frequent items in very large uncertaindata sets or data streams may be found in [78].

4.4 Outlier Detection with Uncertain Data

Theproblem of outlier detection hasalso been extended to thecase of uncertain data. In the case of the outlier detectionproblem, differing levels of uncertainty across differentdimensions may affect the determination of the outliers inthe underlying data. For example, consider the case in Fig. 3,in which the contours of uncertainty for two data points Xand Y are illustrated in the form of elliptical shapes. The datapoint X seems to be further away from the overall datadistribution as compared to the data point Y. However, thecontours of uncertainty are such that the data point Xhas agreater probability of being drawn from the overall datadistribution. Correspondingly, it is possible to define the

concept of an outlier in terms of the probability that a givendata point is drawn from a dense region of the overall datadistribution.

In order to quantify the probability that a given uncertaindata point is drawn from a dense region, we define theconcept of-probability. The -probability of a data point Xiis defined as the probability that the uncertain data point liesin a region with (overall data) density at least . Since the datapoint is uncertain, the -probability may be computed byintegrating the density of the data point along the contour ofthe intersection of the overall density function with threshold. However, this can be computationally challenging from anumerical point of view. Therefore, the -probability may be

estimated with the use of sampling. The idea is to drawmultiple samples from the data and compute the fraction ofthe samples over which the density threshold is specified.This can be used to define the concept of a ; -outlier.


Fig. 3. Effect of uncertainty on outlier detection.



13/15

Definition 4.2. An uncertain data point Xi is defined to be a; -outlier, if the -probability ofXi in some subspace is lessthan .

In order to determine the ; -outliers, the algorithm of[7] explores subspaces in the data in a bottom-up fashion anddetermines all those data points for which the condition ofDefinition 4.2is satisfied. A variety of techniques forspeedingup the algorithm using microclustering is also discussed in

[7]. It has been shown in [7] that the approach is much moreeffective than deterministic algorithms for outlier detection.

4.5 General Approaches to Mining Uncertain Data

The techniques discussed in [17], [51], [52], and [63] areuseful for working with a specific application such asclustering or classification. A different approach is to designan intermediate representation which can be used with avariety of data mining applications. A method of this naturehas been proposed in [9]. In this case, a relaxed assumptionis used that only the errors (in terms of standard deviation)of the records are known rather than the entire probabilitydensity function. This is a more realistic assumption of many

scenarios, since it may often be possible to measure thestandard deviation of an uncertain record, whereas theprobability density function may be obtained only by moreextensive theoretical modeling. In any case, if the pdf isavailable, one can still apply the method by using thederived standard deviation of the density function. It isassumed that the mean value of the ith record is denoted byXi and the standard deviation by i.

In [9], the broad idea is to design an intermediaterepresentation of the data which can then be leveraged inorder to effectively perform the mining process. Thisintermediate representation is in the form of an adjusteddensity estimate. We refer to the density estimate as

adjusted, since the uncertainty is taken into accountwhile creating the estimate. The density estimationfx based on N data points and is defined as follows:

fx 1=N XNi1

1=ffiffiffiffiffiffi2

p h Xi e

xXi22 h2Xi2 : 3

The above result assumes a Gaussian-kernel for eachdata point. This density estimate incorporates the errorinformation, and can be utilized for a variety of data miningtasks as follows:

. Many density-based clustering algorithms [38] can be used in conjunction with this method. Theseclustering algorithms simply use a lower thresholdon fx in order to isolate the dense (clustered)regions of the data. The technique can be used inorder to isolate arbitrarily shaped clusters.

. A related approach [7] uses upper thresholds on thedensity estimate fx in order to isolate the sparseregions in the data. This can be used for outlierdetection, by reporting those data points which lie insuch sparse regions.

. In [9], it has been shown how to use this techniquefor classification. For a given test-instance, one

determines the class-specific density at that point,and reports the class with the highest density. Thedensity estimate can also be computed over differentsubspaces in order to further improve the accuracy.

In general, since density estimation encodes the summarybehavior of the data, it is expected that such an approach can be used for any data mining problem which uses theaggregate data behavior in different spatial localities.

5 SUMMARY

The field of uncertain data management has seen a revival

in recent years because of new ways of collecting datawhich have resulted in the need for uncertain representa-tions. This paper surveys the broad areas of work in thisrapidly expanding field. We presented the important datamining and management techniques in this field along withthe key representational issues in uncertain data manage-ment. While the field will continue to expand over time, it ishoped that this survey will provide an understanding of thefoundational issues and a good starting point to practi-tioners and researchers in focussing on the important andemerging issues in this field.

ACKNOWLEDGMENTSThe research of Charu C. Aggarwal was sponsored in part bythe US Army Research laboratory and the UK Ministry ofDefense under Agreement W911NF-06-3-0001. The viewsand conclusions contained in this document are those of theauthor and should not be interpreted as representing theofficial policies of theUS Government, theUS Army ResearchLaboratory, the UK Ministry of Defense, or the UK Govern-ment. The US and UK governments are authorized toreproduce and distribute reprints for Government purposesnotwithstanding any copyright notice hereon.

REFERENCES[1] S. Abiteboul, P. Kanellakis, and G. Grahne, On the Representa-

tion and Querying of Sets of Possible Worlds, Proc. ACMSIGMOD, 1987.

[2] Managing and Mining Uncertain Data, C. Aggarwal, ed. Springer,2009.

[3] P. Andritsos, A. Fuxman, and R.J. Miller, Clean Answers overDirty Databases: A Probabilistic Approach, Proc. 22nd IEEE IntlConf. Data Eng. (ICDE), 2006.

[4] L. Antova, C. Koch, and D. Olteanu, From Complete toIncomplete Information and Back, Proc. ACM SIGMOD, 2007.

[5] L. Antova, C. Koch, and D. Olteanu, 10106 Worlds and Beyond:

Efficient Representation and Processing of Incomplete Informa-tion, Proc. 23rd IEEE Intl Conf. Data Eng. (ICDE), 2007.

[6]L. Antova, T. Jansen, C. Koch, and D. Olteanu, Fast and SimpleRelational Processing of Uncertain Data, Proc. 24th IEEE IntlConf. Data Eng. (ICDE), 2008.

[7] C.C. Aggarwal and P.S. Yu, Outlier Detection with UncertainData, Proc. SIAM Intl Conf. Data Mining (SDM), 2008.

[8] C.C. Aggarwal, On Unifying Privacy and Uncertain DataModels, Proc. 24th IEEE Intl Conf. Data Eng. (ICDE), 2008.

[9] C.C. Aggarwal, On Density Based Transformations for UncertainData Mining, Proc. 23rd IEEE Intl Conf. Data Eng. (ICDE), 2007.

[10] C.C. Aggarwal and P.S. Yu, A Framework for ClusteringUncertain Data Streams, Proc. 24th IEEE Intl Conf. Data Eng.(ICDE), 2008.

[11] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, A Framework forClustering Evolving Data Streams, Proc. 29th Intl Conf. Very LargeData Bases (VLDB), 2003.

[12] M. Arenas, L. Bertossi, and J. Chomicki, Consistent Query

Answers in Inconsistent Databases, Proc. 18th ACM Symp.Principles of Database Systems (PODS), 1999.[13] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, OPTICS:

Ordering Points to Identify the Clustering Structure, Proc. ACMSIGMOD, 1999.




14/15

[14] D. Barbara, H. Garcia-Molina, and D. Porter, The Management ofProbabilistic Data, IEEE Trans. Knowledge and Data Eng., vol. 4,no. 5, pp. 487-502, Oct. 1992.

[15] D. Bell, J. Guan, and S. Lee, Generalized Union and ProjectOperations for Pooling Uncertain and Imprecise Information,Data and Knowledge Eng., vol. 18, no. 2, 1996.

[16] O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom, ULDBs:Databases with Uncertainty and Lineage, Proc. 32nd Intl Conf.Very Large Data Bases (VLDB), 2006.

[17] J. Bi and T. Zhang, Support Vector Machines with Input Data

Uncertainty, Proc. Advances in Neural Information ProcessingSystems (NIPS), 2004.

[18] C. Bohm, A. Pryakhin, and M. Schubert, The Gauss-Tree:Efficient Object Identification of Probabilistic Feature Vectors,Proc. 22nd IEEE Intl Conf. Data Eng. (ICDE), 2006.

[19] C. Bohm, P. Kunath, A. Pryakhin, and M. Schubert, QueryingObjects Modeled by Arbitrary Probability Distributions, Proc.10th Intl Symp. Spatial and Temporal Databases (SSTD), 2007.

[20] D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, andS. Vaithyanathan, OLAP over Uncertain and Imprecise Data,Proc. 31st Intl Conf. Very Large Data Bases (VLDB 05),pp. 970-981, 2005.

[21] D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan,OLAP over Imprecise Data with Domain Constraints, Proc.33rd Intl Conf. Very Large Data Bases (VLDB), 2007.

[22] R. Cavello and M. Pittarelli, The Theory of ProbabilisticDatabases, Proc. 13th Intl Conf. Very Large Data Bases (VLDB),1987.

[23] A.L.P. Chen, J.-S. Chiu, and F.S.-C. Tseng, Evaluating AggregateOperations over Imprecise Data, IEEE Trans. Knowledge and DataEng., vol. 8, no. 2, pp. 273-284, Apr. 1996.

[24] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. Vitter, EfficientIndexing Methods for Probabilistic Threshold Queries overUncertain Data, Proc. 30th Intl Conf. Very Large Data Bases(VLDB), 2004.

[25] R. Cheng, D. Kalashnikov, and S. Prabhakar, EvaluatingProbabilistic Queries over Imprecise Data, Proc. ACM SIGMOD,2003.

[26] R. Cheng, S. Singh, S. Prabhakar, R. Shah, J. Vitter, and Y. Xia,Efficient Join Processing over Uncertain-Valued Attributes,Proc. 15th ACM Intl Conf. Information and Knowledge Management(CIKM), 2006.

[27] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, J. Vitter, and Y. Xia,Efficient Join Processing over Uncertain Data, Technical ReportCSD TR# 05-004, Dept. of Computer Science, Purdue Univ., 2005.

[28] R. Cheng, D. Kalashnikov, and S. Prabhakar, Querying ImpreciseData in Moving Object Environments, IEEE Trans. Knowledge andData Eng., vol. 16, no. 9, pp. 1112-1127, Sept. 2004.

[29] C.-K. Chui, B. Kao, and E. Hung, Mining Frequent Itemsets fromUncertain Data, Proc. 11th Pacific-Asia Conf. Knowledge Discoveryand Data Mining (PAKDD), 2007.

[30] C.-K. Chui and B. Kao, A Decremental Approach for MiningFrequent Itemsets from Uncertain Data, Proc. 12th Pacific-AsiaConf. Knowledge Discovery and Data Mining (PAKDD), 2008.

[31] G. Cormode and A. McGregor, Approximation Algorithms forClustering Uncertain Data, Proc. 27th ACM SIGMOD-SIGACT-

SIGART Symp. Principles of Database Systems (PODS), 2008.[32] N. Dalvi and D. Suciu, Efficient Query Evaluation on Probabil-

istic Databases, Proc. 30th Intl Conf. Very Large Data Bases(VLDB), 2004.

[

Date post:	07-Apr-2018
Category:	Documents
Upload:	hemachandran5191
View:	224 times
Download:	0 times

Survery Uncertain Data Alg

Documents