+ All Categories
Home > Documents > Efficient k-NN using MapReduce

Efficient k-NN using MapReduce

Date post: 18-Aug-2015
Category:
Upload: alexandru-popescu
View: 218 times
Download: 2 times
Share this document with a friend
Description:
Efficient Processing of k Nearest Neighbor Joins using MapReduce
Popular Tags:
12
Efcient Processing of k Nearest Neighbor Joins using MapReduce Wei Lu Yanyan Shen Su Chen Beng Chin Ooi National University of Singapore {luwei1,shenyanyan,chensu,ooibc}@comp.nus.edu.sg ABSTRACT k nearest neighbor join (kNN join), designed to nd k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining ap- plications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difcult to perform a kNN join on a centralized machine efciently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of comput- ers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance ltering, and hence reduces both the shufing and com- putational costs. To reduce the shufing cost, we propose two ap- proximate algorithms to minimize the number of replicas. Exten- sive experiments on our in-house cluster demonstrate that our pro- posed methods are efcient, robust and scalable. 1. INTRODUCTION k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it. kNN join typically serves as a primi- tive operation and is widely used in many data mining and analytic applications, such as the k-means and k-medoids clustering and outlier detection [5, 12]. As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation. The naive im- plementation of kNN join requires scanning S once for each object in R (computing the distance between each pair of objects from R and S), easily leading to a complexity of O(|R|·|S|). Therefore, considerable research efforts have been made to improve the ef- ciency of the kNN join [4, 17, 19, 18]. Most of the existing work devotes themselves to the design of elegant indexing techniques for avoiding scanning the whole dataset repeatedly and for pruning as many distance computations as possible. All the existing work [4, 17, 19, 18] is proposed based on the centralized paradigm where the kNN join is performed on a sin- gle, centralized server. However, given the limited computational capability and storage of a single machine, the system will eventu- ally suffer from performance deterioration as the size of the dataset increases, especially for multi-dimensional datasets. The cost of computing the distance between objects increases with the num- ber of dimensions; and the curse of the dimensionality leads to a decline in the pruning power of the indexes. Regarding the limitation of a single machine, a natural solution is to consider parallelism in a distributed computational environ- ment. MapReduce [6] is a programming framework for processing large scale datasets by exploiting the parallelism among a cluster of computing nodes. Soon after its birth, MapReduce gains pop- ularity for its simplicity, exibility, fault tolerance and scalabili- ty. MapReduce is now well studied [10] and widely used in both commercial and scientic applications. Therefore, MapReduce be- comes an ideal framework of processing kNN join operations over massive, multi-dimensional datasets. However, existing techniques of kNN join cannot be applied or extended to be incorporated into MapReduce easily. Most of the existing work rely on some centralized indexing structure such as the B + -tree [19] and the R-tree [4], which cannot be accommodat- ed in such a distributed and parallel environment directly. In this paper, we investigate the problem of implementing kNN join operator in MapReduce. The basic idea is similar to the hash join algorithm. Specically, the mapper assigns a key to each ob- ject from R and S; the objects with the same key are distributed to the same reducer in the shufing process; the reducer performs the kNN join over the objects that have been shufed to it. To guar- antee the correctness of the join result, one basic requirement of data partitioning is that for each object r in R, the k nearest neigh- bors of r in S should be sent to the same reducer as r does, i.e., the k nearest neighbors should be assigned with the same key as r. As a result, objects in S may be replicated and distributed to mul- tiple reducers. The existence of replicas leads to a high shufing cost and also increases the computational cost of the join operation within a reducer. Hence, a good mapping function that minimizes the number of replicas is one of the most critical factors that affect the performance of the kNN join in MapReduce. In particular, we summarize the contributions of the paper as fol- lows. We present an implementation of kNN join operator using MapReduce, especially for large volume of multi-dimensional data. The implementation denes the mapper and reducer jobs and requires no modications to the MapReduce frame- work. We design an efcient mapping method that divides object- s into groups, each of which is processed by a reducer to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 38th International Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. Proceedings of the VLDB Endowment, Vol. 5, No. 10 Copyright 2012 VLDB Endowment 2150-8097/12/06... $ 10.00. 1016
Transcript

Efcient Processing of k Nearest Neighbor Joins usingMapReduceWei Lu Yanyan Shen Su Chen Beng Chin OoiNational University of Singapore{luwei1,shenyanyan,chensu,ooibc}@comp.nus.edu.sgABSTRACTk nearest neighbor join (kNN join), designed to ndk nearestneighbors from a datasetS for every object in another datasetR,is a primitive operation widely adopted by many data mining ap-plications. As a combination of thek nearest neighbor query andthe join operation,kNN join is an expensive operation. Given theincreasing volume of data, it is difcult to perform akNN join ona centralized machine efciently. In this paper, we investigate howto performkNN join using MapReduce which is a well-acceptedframework for data-intensive applications over clusters of comput-ers. In brief, the mappers cluster objects into groups; the reducersperform thekNN join on each group of objects separately. Wedesign an effective mapping mechanism that exploits pruning rulesfor distance ltering, and hence reduces both the shufing and com-putational costs. To reduce the shufing cost, we propose two ap-proximate algorithms to minimize the number of replicas. Exten-sive experiments on our in-house cluster demonstrate that our pro-posed methods are efcient, robust and scalable.1. INTRODUCTIONk nearest neighbor join (kNN join) is a special type of join thatcombines each object in a datasetR with thek objects in anotherdataset S that are closest to it.kNN join typically serves as a primi-tive operation and is widely used in many data mining and analyticapplications, such as thek-means andk-medoids clustering andoutlier detection [5, 12].As a combination of the k nearest neighbor (kNN) query and thejoin operation,kNN join is an expensive operation. The naive im-plementation of kNN join requires scanning S once for each objectin R (computing the distance between each pair of objects from RandS), easily leading to a complexity ofO(|R| |S|). Therefore,considerable research efforts have been made to improve the ef-ciency of thekNN join [4, 17, 19, 18]. Most of the existing workdevotes themselves to the design of elegant indexing techniques foravoiding scanning the whole dataset repeatedly and for pruning asmany distance computations as possible.All the existing work [4, 17, 19, 18] is proposed based on thecentralized paradigm where thekNN join is performed on a sin-gle, centralized server. However, given the limited computationalcapability and storage of a single machine, the system will eventu-ally suffer from performance deterioration as the size of the datasetincreases, especially for multi-dimensional datasets.The cost ofcomputing the distance between objects increases with the num-ber of dimensions; and the curse of the dimensionality leads to adecline in the pruning power of the indexes.Regarding the limitation of a single machine, a natural solutionis to consider parallelism in a distributed computational environ-ment. MapReduce [6] is a programming framework for processinglarge scale datasets by exploiting the parallelism among a clusterof computing nodes. Soon after its birth, MapReduce gains pop-ularity for its simplicity, exibility, fault tolerance and scalabili-ty. MapReduce is now well studied [10] and widely used in bothcommercial and scientic applications. Therefore, MapReduce be-comes an ideal framework of processing kNN join operations overmassive, multi-dimensional datasets.However, existing techniques ofkNN join cannot be applied orextended to be incorporated into MapReduce easily. Most of theexisting work rely on some centralized indexing structure such asthe B+-tree [19] and the R-tree [4], which cannot be accommodat-ed in such a distributed and parallel environment directly.In this paper, we investigate the problem of implementingkNNjoin operator in MapReduce. The basic idea is similar to the hashjoin algorithm. Specically, the mapper assigns a key to each ob-ject from R and S; the objects with the same key are distributed tothe same reducer in the shufing process; the reducer performs thekNN join over the objects that have been shufed to it. To guar-antee the correctness of the join result, one basic requirement ofdata partitioning is that for each object r in R, the k nearest neigh-bors ofr inS should be sent to the same reducer asr does, i.e.,the k nearest neighbors should be assigned with the same key as r.As a result, objects in S may be replicated and distributed to mul-tiple reducers. The existence of replicas leads to a high shufingcost and also increases the computational cost of the join operationwithin a reducer. Hence, a good mapping function that minimizesthe number of replicas is one of the most critical factors that affectthe performance of the kNN join in MapReduce.In particular, we summarize the contributions of the paper as fol-lows. We present an implementation ofkNN join operator usingMapReduce, especially for large volume of multi-dimensionaldata.The implementation denes the mapper and reducerjobs and requires no modications to the MapReduce frame-work. We design an efcient mapping method that divides object-s into groups, each of which is processed by a reducer toPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for prot or commercial advantage and that copiesbear this notice and the full citation on the rst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specicpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 38th International Conference on Very Large Data Bases,August 27th - 31st 2012, Istanbul, Turkey.Proceedings of the VLDB Endowment, Vol. 5, No. 10Copyright 2012 VLDB Endowment 2150-8097/12/06...$ 10.00.1016perform the kNN join. First, the objects are divided into par-titions based on a Voronoi diagram with carefully selectedpivots. Then, data partitions (i.e., Voronoi cells) are clusteredinto groups only if the distances between them are restrictedby a specic bound. We derive a distance bound that leads togroups of objects that are more closely involved in the kNNjoin. We derive a cost model for computing the number of replicasgenerated in the shufing process. Based on the cost mod-el, we propose two grouping strategies that can reduce thenumber of replicas greedily. We conduct extensive experiments to study the effect of var-ious parameters using two real datasets and some syntheticdatasets. The results show that our proposed methods areefcient, robust, and scalable.The remainder of the paper is organized as follows. Section 2 de-scribes some background knowledge. Section 3 gives an overviewof processing kNN join in MapReduce framework, followed by thedetails in Section 4. Section 5 presents the cost model and groupingstrategies for reducing the shufing cost. Section 6 reports the ex-perimental results. Section 7 discusses related work and Section 8concludes the paper.2. PRELIMINARIESIn this section, we rst dene kNN join formally and then give abrief review of the MapReduce framework. Table 1 lists the sym-bols and their meanings used throughout this paper.2.1 kNN JoinWe consider data objects in ann-dimensional metric space D.Given two data objectsr ands, |r, s| represents the distance be-tweenr ands in D. For the ease of exposition, the Euclidean dis-tance (L2) is used as the distance measure in this paper, i.e.,|r, s| =

1in(r[i] s[i])2, (1)wherer[i] (resp. s[i]) denotes the value ofr (resp. s) along theithdimension in D. Without loss of generality, our methods canbe easily applied to other distance measures such as the Manhattandistance (L1), and the maximum distance (L).DEFINITION 1. (k nearest neighbors) Given an object r, adataset S and an integerk, thek nearest neighbors ofr fromS,denoted asKNN(r, S), is a set of k objects fromS that o KNN(r, S), s S KNN(r, S), |o, r| |s, r|.DEFINITION 2. (kNN join) Given two datasetsR andS andan integer k, kNN join of R and S (denoted as RKNNS, abbre-viated asR S), combines each objectr R with itsk nearestneighbors from S. Formally,R S= {(r, s)|r R, s KNN(r, S)} (2)According to Denition 2, RS is a subset of RS. Note thatkNN join operation is asymmetric, i.e.,R S =S R. Givenk |S|, the cardinality of |R S| isk |R|. In the rest of thispaper, we assume that k |S|.Otherwise, kNN join degradesto the cross join and just generates the result of Cartesian productR S.Table 1: Symbols and their meaningsSymbol DenitionD an n-dimensional metric spaceR (resp.S) an object set R (resp.S) in Dr (resp.s) an object, r R (resp.s S)|r, s| the distance from r to sk the number of near neighborsKNN(r, S) the k nearest neighbors of r from SR S kNN join of R and SP a set of pivotspi a pivot inPpr the pivot inP that is closest to rPRithe partition of R that corresponds to pipi.dj the jthsmallest distance of objects in PSi to piU(PRi) max{|r, p||r PRi }L(PRi) min{|r, p||r PRi }TR the summary table for partitions in RN the number of reducers2.2 MapReduce FrameworkMapReduce [6] is a popular programming framework to sup-port data-intensive applications using shared-nothing clusters.InMapReduce, input data are represented as key-value pairs. Sever-al functional programming primitives including Map and Reduceare introduced to process the data.Map function takes an inputkey-value pair and produces a set of intermediate key-value pairs.MapReduce runtime system then groups and sorts all the interme-diate values associated with the same intermediate key, and sendsthem to the Reduce function. Reduce function accepts an interme-diate key and its corresponding values, applies the processing logic,and produces the nal result which is typically a list of values.Hadoop is an open source software that implements the MapRe-duce framework. Data in Hadoop are stored in HDFS by default.HDFS consists of multiple DataNodes for storing data and a masternode called NameNode for monitoring DataNodes and maintain-ing all the meta-data.In HDFS, imported data will be split intoequal-size chunks, and the NameNode allocates the data chunks todifferent DataNodes. The MapReduce runtime system establishestwo processes, namely JobTracker and TaskTracker. The JobTrack-er splits a submitted job into map and reduce tasks and schedulesthe tasks among all the available TaskTrackers. TaskTrackers willaccept and process the assigned map/reduce tasks. For a map task,the TaskTracker takes a data chunk specied by the JobTracker andapplies the map() function. When all the map() functions com-plete, the runtime system groups all the intermediate results andlaunches a number of reduce tasks to run the reduce() functionand produce the nal results. Both map() and reduce() func-tions are specied by the user.2.3 Voronoi Diagram-based PartitioningGiven a dataset O, the main idea of Voronoi diagram-based par-titioning is to select M objects (which may not belong toO) aspivots, and then split objects of O into M disjoint partitions whereeach object is assigned to the partition with its closest pivot 1. Inthis way, the whole data space is split into M generalized Voronoicells. Figure 1 shows an example of splitting objects into 5 par-titions by employing the Voronoi diagram-based partitioning. For1In particular, if there exist multiple pivots that are closest to anobject, then the object is assigned to the partition with the smallestnumber of objects.1017

Figure 1: An example of data partitioningthe sake of brevity, let P be the set of pivots selected. pi P, POidenotes the set of objects fromO that takes pi as their closest pivot.For an objecto, letpo andPOobe its closest pivot and the corre-sponding partition respectively.In addition, we useU(POi) andL(POi) to denote the maximum and minimum distance from pivotpi to the objects ofPOi , i.e.,U(POi)=max{|o, pi||o POi},L(POi) = min{|o, pi||o POi}.DEFINITION 3. (Range Selection) Given a datasetO, an ob-ject q, and a distance threshold , range selection of q from O is tond all objects (denoted asO) of O, such that o O, |q, o| .By splitting the dataset into a set of partitions, we can answerrange selection queries based on the following theorem.THEOREM 1. [8] Given two pivotspi, pj, let HP(pi, pj) bethe generalized hyperplane, where any object o lying on HP(pi, pj)has the equal distance to pi and pj. o POj , the distance of o toHP(pi, pj), denoted as d(o, HP(pi, pj)) is:d(o, HP(pi, pj)) = |o, pi|2|o, pj|22 |pi, pj|(3)Figure 2(a) shows distanced(o, HP(pi, pj)).Given object q,its belonging partition POq , and another partition POi , according toTheorem 1, it is able to compute the distance fromq to HP(pq, pi).Hence, we can derive the following corollary.COROLLARY 1. Given a partitionPOiandPOi=POq , if wecan derive d(q, HP(pq, pi)) >, then o POi , |q, o|>.Given a partition POi , if we getd(q, HP(pq, pi))>, accord-ing to Corollary 1, we can discard all objects ofPOi . Otherwise,we check partial objects of POibased on Theorem 2.THEOREM 2. [9, 20] Given a partitionPOi , o POi , thenecessary condition that |q, o| is:max{L(POi), |pi, q| } |pi, o| min{U(POi), |pi, q| +}(4)Figure 2(b) shows an example of the bounding area of Equation4. To answer range selections, we only need to check objects thatlie in the bounding area of each partition.3. AN OVERVIEW OF KNN JOIN USINGMAPREDUCEIn MapReduce, the mappers produce key-value pairs based onthe input data; each reducer performs a specic task on a group

(a)d(o, HP(pi, pj))

} , , , ), ( max

} , , , ), ( min +

) (

) (

(b) bounding area of Equation 4Figure 2: Properties of data partitioningof pairs with the same key. In essence, the mappers do somethingsimilar to (typically more than) the hashing function. A naive andstraightforward idea of performing kNN join in MapReduce is sim-ilar to the hash join algorithm.Specically, the map() function assigns each object r R akey; based on the key, R is split into disjoint subsets, i.e., R=

1iNRi, whereRi

Rj = , i =j; each subset Ri is dis-tributed to a reducer. Without any pruning rule, the entire set S hasto be sent to each reducer to be joined withRi; nallyR S=

1iNRiS.In this scenario, there are two major considerations that affectthe performance of the entire join process.1. The shufing cost of sending intermediate results from map-pers to reducers.2. The cost of performing the kNN join on the reducers.Obviously, the basic strategy is too expensive. Each reducer per-formskNN join between a subset ofR and the entireS. Given alarge population ofS, it may go beyond the capability of the re-ducer.An alternative framework [21], called H-BRJ, splits bothR andS into N disjoint subsets, i.e.,R=

1iNRi,S=

1jNSj. Similarly, the partitioning ofR andS in H-BRJ isperformed by themap() function; a reducer performs thekNNjoin between a pair of subsets Ri and Sj; nally, the join results ofall pairs of subsets are merged and RS=

1i,jNRi Sj.In H-BRJ, R andS are partitioned into equal-sized subsets on arandom basis.While the basic strategy can produce the join result using oneMapReduce job, H-BRJ requires two MapReduce jobs. Since thesetS is partitioned into several subsets, the join result of the rstreducer is incomplete, and another MapReduce is required to com-bine the results ofRiSj for all 1 j N. Therefore, theshufing cost of H-BRJ isN (|R| +|S|) +

i

j|RiSj|2,while for the basic strategy, it is |R| +N |S|.In order to reduce the shufing cost, a better strategy is thatRis partitioned into N disjoint subsets and for each subset Ri, nd asubset of Si that Ri S=Ri Si and RS=

1iNRi Sj.Then, instead of sending the entireS to each reducer (as in thebasic strategy) or sending each Ri toN reducers, Si is sent to thereducer that Ri belongs to and the kNN join is performed betweenRi and Si only.2N (|R|+ |S|) is the shufing cost of the rst MapReduce.

i

j |RiSj| is the shufing cost of the second MapReducefor merging the partial results.1018lvoLs8, SMap 8educe MapkeyCb[ecLnearesL nelghborsva|uer1 knn(r1,S)r2 knn(r2,S)r3 knn(r3,S)... ...keyarLlLlon lu uaLaseL lu Cb[ecLva|ue1 8 1 r11 S 1 s12 S 1 s1... ... ... ...ulsL04747...Summary 1able 18art|t|on ID # of ob[ects |n |k1 9892 998U(|k)100.3103.1L(|k)00... ...n 1003...109.4...0Summary 1able 1S8, SlvoL SelecLlonMap CuLpuL Map CuLpuLMap CuLpuLart|t|on Stat|st|csI|rst Map-keduce Second Map-keducereprocess|ngkeyLlne lu uaLaseL lu Cb[ecLva|ue1 8 1 r12 S 1 s13 S 2 s2... ... ... ...ulsL04748...# of ob[ects |n |S10001100U(|S)100.1110.1...1100...99.7p|.s1s11s21...sn1...............p|.sks1ks2k...snkL(|S)00...0art|t|on ID12...n8educe CuLpuLFigure 3: An overview of kNN join in MapReduceThis approach avoids replication on the set R and sending theentire set S to all reducers. However, to guarantee the correctnessof the kNN join, the subset Si must contain the k nearest neighborsof everyrRi, i.e., r Ri, KNN(r, S) Si.Note thatSi Sj may not be empty, as it is possible that objects is one ofthek nearest neighbors ofriRi andrj Rj.Hence, someof the objects in S should be replicated and distributed to multiplereducers. The shufing cost is |R| + |S|, where is the averagenumber of replicas of an object in S. Apparently, if we can reducethe value of, both shufing and computational cost we considercan be reduced.In summary, for the purpose of minimizing the join cost, we needto1. nd a good partitioning of R;2. nd the minimal set ofSi for eachRi R, given a parti-tioning of R 3.Intuitively, a good partitioning ofR should cluster objects in Rbased on their proximity, so that the objects in a subset Ri are morelikely to share commonk nearest neighbors from S. For eachRi,the objects in each correspondingSi are cohesive, leading to a s-maller size ofSi.Therefore, such partitioning not only leads toa lower shufing cost, but also reduces the computational cost ofperforming the kNN join between each Ri and Si, i.e., the numberof distance calculations.4. HANDLING KNN JOIN USING MAPRE-DUCEIn this section, we introduce our implementation ofkNN joinusing MapReduce. First, Figure 3 illustrates the working ow ofourkNN join, which consists of one preprocessing step and twoMapReduce jobs.3The minimum set of Si is Si =

1j|Ri|KNN(ri, S). How-ever, it is impossible to nd out thek nearest neighbors for allriapriori. First, the preprocessing step nds out a set of pivot objectsbased on the input dataset R.The pivots are used to cre-ate a Voronoi diagram, which can help partition objects in Reffectively while preserving their proximity. The rst MapReduce job consists of a single Map phase,which takes the selected pivots and datasetsR andS as theinput. It nds out the nearest pivot for each object in R Sand computes the distance between the object and the piv-ot. The result of the mapping phase is a partitioning onR,based on the Voronoi diagram of the pivots. Meanwhile, themappers also collect some statistics about each partition Ri. Given the partitioning on R, mappers of the second MapRe-duce job nd the subset Si of S for each subset Ri based onthe statistics collected in the rst MapReduce job.Finally,each reducer performs thekNN join between a pair ofRiand Si received from the mappers.4.1 Data PreprocessingAs mentioned in previous section, a good partitioning ofR foroptimizing kNN join should cluster objects based on their proximi-ty. We adopt the Voronoi diagram-based data partitioning techniqueas reviewed in Section 2, which is well-known for maintaining dataproximity, especially for data in multi-dimensional space. There-fore, before launching the MapReduce jobs, a preprocessing stepis invoked in a master node for selecting a set of pivots to be usedfor Voronoi diagram-based partitioning. In particular, the followingthree strategies can be employed to select pivots. Random Selection. First,T random sets of objects are se-lected from R. Then, for each set, we compute the total sumof the distances between every two objects. Finally, the ob-jects from the set with the maximum total sum distance areselected as the pivots for data partitioning. Farthest Selection. The set of pivots are selected iterativelyfrom a sample of the original dataset R (since preprocessingprocedure is executed on a master node, the original datasetmay be too large for it to process). First, we randomly selectan object as the rst pivot. Next, the object with the largest1019distance to the rst pivot is selected as the second pivot. Intheithiteration, the object that maximizes the sum of itsdistance to the rst i 1 pivots is chosen as the ithpivot. k-means Selection. Similar to the farthest selection, k-meansselection rst does sampling on theR. Then, traditionalk-means clustering method is applied on the sample. With thek data clusters generated, the center point of each cluster ischosen as a pivot for the Voronoi diagram-based data parti-tioning.4.2 First MapReduce JobGiven the set of pivots selected in the preprocessing step, welaunch a MapReduce job for performing data partitioning and col-lecting some statistics for each partition. Figure 4 shows an exam-ple of the input/output of the mapper function of the rst MapRe-duce job.Specically, before launching the map function, the selected piv-otsP are loaded into main memory in each mapper. A mapper se-quentially reads each object o from the input split, computes thedistance betweeno and all pivots inP, and assigns o to the closestpivot P. Finally, as illustrated in Figure 4, the mapper outputs eachobject o along with its partition id, original dataset name (R or S),distance to the closest pivot.Meanwhile, the rst map function also collects some statistic foreach input data split and these statistics are merged together whilethe MapReduce job completes. Two in-memory tables called sum-mery tables are created to keep these statistics. Figure 3 shows anexample of the summary tables TR andTS for partitions of R andS, respectively. Specically,TR maintains the following informa-tion for every partition of R: the partition id, the number of objectsin the partition, the minimum distanceL(PRi) and maximum dis-tanceL(PRi) from an object in partitionPRito the pivot.Notethat although the pivots are selected based on dataset R alone, theVoronoi diagram based on the pivots can be used to partitionS aswell.TS maintains the same elds as those in TR for S. Moreover,TS also maintains the distances between objects in KNN(pi, PSi)andpi, whereKNN(pi, PSi) refers to thek nearest neighbors ofpivot pi from objects in partition PSi . In Figure 3, pi.dj in TS rep-resents the distance between pivotpi and itsjthnearest neighborinKNN(pi, PSi). The information inTR andTS will be used toguide how to generate Si for Ri as well as to speed up the compu-tation of Ri Si by deriving distance bounds of the kNN for anyobject of R in the second MapReduce job.4.3 Second MapReduce JobThe second MapReduce job performs thekNN join in the wayintroduced in Section 3. The main task of the mapper in the sec-ond MapReduce is to nd the corresponding subset Si for each Ri.Each reducer performs the kNN join between a pair of Ri and Si.As mentioned previously, to guarantee the correctness, Si shouldcontains the kNN of all r Ri, i.e., Si=

rjRiKNN(rj, S).However, we cannot get the exact Si without performing the kNNjoin on Ri and S. Therefore, in the following, we derive a distancebound based on the partitioning of R which can help us reduce thesize of Si.4.3.1 Distance Bound ofkNNInstead of computing the kNN from S for each object of R, wederive a bound of the kNN distance using a set oriented approach.Given a partition PRi(i.e., Ri) of R, we bound the distance of thekNN for all objects of PRiat a time based on TR and TS, which wehave as a byproduct of the rst MapReduce.12...nr1r2...rn12...ns1s2...snMapMapMapMapP1P1...RR...2.33.4...r1r2...P1P1...RR...2.33.4...riri+1...P1P1...SS...2.33.4...s1s2...P1P1...SS...2.33.4...sisi+1...nput OutputRSP1P2...447452...00...90.1103.1...Summary TablesP1P2...542537...00...100.399.6...P1P2...500550...00...100.1110.1...P1P2...500560...00...99.2103.5...00............00............36.524.3...27.630.3...Figure 4: Partitioning and building the summary tablesTHEOREM 3. Given a partition PRiR, an object s of PSjS, the upper bound distance froms to r PRi , denoted asub(s, PRi), is:ub(s, PRi) =U(PRi) +|pi, pj| +|pj, s| (5)Proof. r PRi , according to the triangle inequality, |r, pj| |r, pi| +|pi, pj|. Similarly, |r, s| |r, pj| +|pj, s|. Hence, |r, s| |r, pi| + |pi, pj| + |pj, s|. Since r PRi , according to the deni-tion ofU(PRi), |r, pi| U(PRi). Clearly, we can derive |r, s| U(PRi) +|pi, pj| +|pj, s| =ub(s, PRi).Figure 5(a) shows the geometric meaning of ub(s, PRi). Accord-ing to the Equation 5, we can nd a set ofk objects fromS withthe smallest upper bound distances as the kNN of all objects in PRi .For ease of exposition, let KNN(PRi, S) be the k objects from Swith the smallestub(s, PRi). Apparently, we can derive a bound(denoted as i that corresponds to PRi ) of the kNN distance for allobjects in PRias follows:i= maxsKNN(PRi,S)|ub(s, PRi)|. (6)Clearly, r PRi , the distance fromr to any object of KNN(r, S)is less than or equal to i. Hence, we are able to bound the distanceof thekNN for all objects ofPRiat a time. Moreover, accordingto the Equation 5, we can also observe that in each partitionPSi ,k objects with the smallest distances to pi may contribute to reneKNN(PRi, S) while the remainder cannot. Hence, we only main-taink smallest distances of objects from each partition ofS to itscorresponding pivot in summary table TS (shown in Figure 3).Algorithm 1: boundingKNN(PRi )1 create a priority queue PQ;2 foreach PSj do3 foreach s KNN(pj, PSj) do /*set in TS */4 ub(s, PRi) U(PRi) +|pi, pj| +|s, pj|;5 if PQ.size dist then7 PQ.remove(); PQ.add(ub(s, PRi));8 else break;9 return PQ.top;Algorithm 1 shows the details on how to computei. We rstcreate a priority queue PQ with sizek (line 1). For partitionPSj , we computeub(s, PRi) for eachs KNN(pj , PSj), where1020 RiP UiPfP

RiP s ub

f i p p HPs(a) upper boundiPfP

RiP s lb RiP Us

f i p p HP(b) lower boundFigure 5: Bounding k nearest neighbors|s, pj| is maintained in TS. To speed up the computation of i, wemaintain |s, pj| in TS based on the ascending order. Hence, whenub(s, PRi) PQ.top, we can guarantee that no remaining objectsin KNN(pj, PSj) help rene i (line 8). Finally, we return the topof PQ which is taken as i (line 9).4.3.2 FindingSi forRiSimilarly to Theorem 3, we can derive the lower bound distancefrom an object s PSj to any object of PRias follows.THEOREM 4. Given a partition PRi , an object s of PSj , the low-er bound distance from s to r PRi , denoted by lb(s, PRi), is:lb(s, PRi) = max{0, |pi, pj| U(PRi) |s, pj|} (7)PROOF.r PRi , according to the triangle inequality, |r, pj| |pj, pi| |pi, r|. Similarly, |r, s| |r, pj| |pj, s|. Hence,|r, s| |pj, pi| |pi, r| |pj, s|. Sincer PRi , according tothe denition ofU(PRi), |r, pi| U(PRi).Thus we can derive|r, s| |pi, pj| U(PRi) |s, pj|. As the distance between anytwo objects is not less than 0, the low bound distancelb(s, PRi) isset to max{0, |pi, pj| U(PRi) |s, pj|}Figure 5(b) shows the geometric meaning of lb(s, PRi). Clearly,s S, if we can verifylb(s, PRi)>i, thens cannot be one ofKNN(r, S) for any r PRiand s is safe to be pruned. Hence, it iseasy for us to verify whether an object s S needs to be assignedto Si.THEOREM 5. Given a partitionPRiand an objects S, thenecessary condition that s is assigned to Si is that: lb(s, PRi) i.According to Theorem 5, s S, by computinglb(s, PRi) forallPRiR, we can derive allSi thats is assigned to. However,when the number of partitions for R is large, this computation costmight increase signicantly since s PSj , we need to compute|pi, pj|. To cope with this problem, we propose Corollary 2 to ndall Si which s is assigned to only based on |s, pj|.COROLLARY 2. Given a partition PRiand a partition PSj , s PSj , the necessary condition that s is assigned to Si is that:|s, pj| LB(PSj, PRi), (8)where LB(PSj, PRi) = |pi, pj| U(PRi) i.PROOF. The conclusion directly follows Theorem 5 and Equa-tion 7.According to Corollary 2, for partition PSj , objects exactly lyingin region [LB(PSj, PRi), U(PSj)] are assigned to Si. Algorithm 2shows how to compute LB(PSj, PRi), which is self-explained.4.3.3 kNN Join betweenRi andSiAs a summary, Algorithm 3 describes the details ofkNN joinprocedure that is described in the second MapReduce job. Beforelaunching map function, we rst computeLB(PSj, PRi) for everyAlgorithm 2: compLBOfReplica()1 foreach PRido2 i boundingKNN (PRi );3 foreach PSj do4 foreach PRido5 LB(PSj, PRi) |pi, pj| U(PRi) i;Algorithm 3: kNN join1map-setup /*before running map function*/2 compLBOfReplica ();3map (k1,v1)4 if k1.dataset =R then5 pid getPartitionID(k1.partition);6 output(pid, (k1, v1));7 else8 PSjk1.partition;9 foreach PRido10 if LB(PSj, PRi) k1.dist then11 output(i, (k1, v1));12reduce (k2,v2) /*at the reducing phase*/13 parse PRiand Si (PSj1, . . . , PSjM) from (k2, v2);14 sort PSj1, . . . , PSjM based on the ascending order of|pi, pjl|;15 compute i maxsKNN(PRi,S)|ub(s, PRi)|;16 for r PRido17 i; KNN(r, S) ;18 for j j1 to jM do19 if PSj can be pruned by Corollary 1 then20 continue;21 foreach s PSj do22 if s is not pruned by Theorem 2 then23 rene KNN(r, S) by s;24 maxoKNN(r,S){|o, r|};25 output(r,KNN(r, S));PSj (line 12). For each object r R, the map function generates anew key value pair in which the key is its partition id, and the valueconsists ofk1 andv1 (line 46). For each objects S, the mapfunction creates a set of new key value pairs, if not pruned basedon Corollary 2 (line 711).In this way, objects in each partition ofR and their potentialknearest neighbors will be sent to the same reducer. By parsing thekey value pair (k2, v2), the reducer can derive the partition PRiandsubset Si that consists of PSj1, . . . , PSjM (line 13), and compute thekNN of objects in partition PRi(line 1625).rPRi , in order to reduce the number of distance compu-tations, we rst sort the partitions fromSi by the distances fromtheir pivots to pivot pi in the ascending order (line 14).This isbased on the fact that if a pivot is near topi, then its correspond-ing partition often has higher probability of containing objects thatare closer tor. In this way, we can derive a tighter bound dis-tance ofkNN for every object ofPRi , leading to a higher prun-ing power. Based on Equation 6, we can derive a bound of the1021kNN distance, i, for all objects ofPRi .Hence, we can issue arange search with queryr and thresholdi over datasetSi. First,KNN(r, S) is set to empty (line 17). Then, all partitionsPSj arechecked one by one (line 1824). For each partition PSj , based onCorollary 1, ifd(r, HP(pi, pj)) >, no objects inPSjcan helprene KNN(r, S), and we proceed to check the next partition di-rectly (line 1920). Otherwise, s PSj , if s cannot be pruned byTheorem 2, we need to compute the distance |r, s|. If |r, s|


Recommended