Experiences on Processing Spatial Data with MapReduce ssdbm09

1. Experiences on Processing SpatialData with MapReduce Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali RisheFlorida International UniversitySchool of Computing and Information Sciences11200 SW 8th St, Miami, FL 33199 {acary001,sunz,vagelis,rishen}@cis.fiu.edu Sponsored by: NSF Cluster Exploratory (CluE)

2. Agenda1. Introduction2. Solving Spatial Problems in MapReduceR-Tree Index ConstructionAerial Image Processing3. Experimental Results4. Related Work5. Conclusion2 Florida International University 3. Introduction Spatial databases mainly store: Raster data (satellite/aerial digital images), and Vector data (points, lines, polygons). Traditional sequential computing models maytake excessive time to process large and complexspatial repositories. Emerging parallel computing models, such asMapReduce, provide a potential for scaling dataprocessing in spatial applications.3 Florida International University 4. Introduction (cont.) MapReduce is an emerging massively parallel computing model (Google) composed of two functions: Map: takes a key/value pair, executes somecomputation, and emits a set of intermediatekey/value pairs as output. Reduce: merges its intermediate values, executessome computation on them, and emits the finaloutput. In this work, we present our experiences in applying the MapReduce model to: Bulk-construct R-Trees (vector) and4 Florida International University Compute aerial image quality (raster) 5. Introduction (cont.) Apaches HadoopProxy Linux operating Cloudsystem XEN hypervisor Hadoop Distributed1InternetFile System (HDFS) bin/hadoop dfs put SOCKS proxyserver2bin/hadoop jar 480 computers 3 bin/hadoop dfs get(nodes), each halfterabytes storage5 Florida International University 6. 2. Solving Spatial Problems inMapReduce R-Tree Index Construction Aerial Image Processing 7. MapReduce (MR) R-TreeConstruction R-Tree Bulk-Construction Every object o in database D has two attributes: o.id - the objects unique identifier. o.P - the objects location in some spatial domain. The goal is to build an R-Tree index on D. MapReduce Algorithm1. Database partitioning function computation (MR).2. A small R-Tree is created for each partition (MR).3. The small R-Trees are merged into the final R- Tree.7 Florida International University 8. Phase 1 Partitioning Function Goal: compute f to assign objects of D into one of RMap and Reduce inputs/outputss.t.: possible partitions in computing partitioning function f. Function Input: (Key, Value) Output: (Key,are R (ideally) equally-sized partitions Value) generated (minimal variance(C, U(o.P))Map (o.id, o.P) is acceptable). Reduce (C, list(ui, i=1, .., L))S Objects close in the spatial domain are placedWhere: within the same partition. o is an solution: Proposed spatial object in the database. C which is a constant that helps in sending Mappers Use Z-order space-filling curve to map spatialoutputs to a single Reducer. coordinate samples (3%) into an value.U is a space-filling curve, e.g. Z-order sorted sequence. containing R-1 splitting points.S is an array8 Collect splitting points that partition theFlorida International University sequence in R ranges. 9. Phase 2 - R-Tree Construction in MR Mappers compute f() values for objects. Reducers compute an R-Tree for each group of objects with identical f() value MapReduce functions in constructing R-Trees.FunctionInput: (Key, Value)Output: (Key, Value)Map (o.id, o.P) (f(o.P), o) Reduce (f(o.P), list(oi, i=1, .., A)) tree.rootWhere: o is an spatial object in the database. f is the partitioning function computed in phase1. Tree.root is the R-Tree root node.9 Florida International University 10. Phase 3 - R-Tree Consolidation sequential process10 Florida International University 11. Image Processing in MapReduce Aerial Image Quality Computation Let d be a orthorectified aerial photography (DOQQ) file and t be a tile inside d, d.name is ds file name and t.q is the quality information of tile t. The goal is to compute a quality bitmap for d. MapReduce Algorithm A customized InputFormatter partitions each DOQQ file d into several splits containing multiple tiles. The Mappers compute the quality bitmap for each tile inside a split. The Reducers merge all the bitmaps that belongs to a file d and write them to an output file.11 Florida International University 12. Image Processing in MapReduce MapReduce Algorithm Input and output of map and reduce functions FunctionInput: (Key, Value) Output: (Key, Value) Map(d.name+t.id, t)(d.name, (t.id,t.q))Reduce (d.name, list(t.id,t.q)) Quality-bitmap of d Where: d is a DOQQ file. t is a tile in d. t.q is the quality bitmap of t.12 Florida International University 13. 3. Experimental Results 14. Experimental Results: Setting Data Set Table 4. Spatial data sets used in experiments*. ProblemDataData size ObjectsDescription set (GB)R-TreeFLD 11.4 M 5 Points of properties in the state of Florida. Yellow pages directory of points of businesses mostly YPD 37 M 5.3 in the United States but also in other countries.ImageMiami-Aerial imagery of Miami-Dade county, FL (3-inch482 files 52QualityDaderesolution)* Data sets supplied by the High Performance Database Research Center at Florida International University Environment The cluster was provided by the Google and IBM Academic Cluster Computing Initiative. The cluster contains around 480 computers running Hadoop - open source MapReduce.14 Florida International University 15. Experimental Results: R-Tree R-Tree Construction Performance Metrics 30.0060.00 25.0050.00 MR2MR2 20.0040.00 Time (min)Time (min) MR1MR1 15.0030.00 10.0020.005.0010.000.00 0.00 24 8 1632644 8163264ReducersReducers(a) FLD data set(b) YPD data setMapReduce job completion times for various number of reducers in phase-2 (MR2).15 Florida International University 16. Experimental Results: R-Tree MapReduce R-Trees vs. Single Process (SP)Objects per Reducer Consolidated R-TreeData set RAverage StdevNodes Height FLD 25,690,41912,183172,776 4 42,845,210 6,347172,624 4 81,422,605 2,235173,141 416711,379 2,533162,518 432355,651 2,379173,273 364177,826 1,816173,445 3SP 11,382,185 0172,681 4 YPD 49,257,18822,137568,854 4 84,628,594 9,413568,716 4162,314,297 7,634568,232 4321,157,149 6,043567,550 464578,574 2,982566,199 4SP 37,034,126 0587,353 516 Florida International University 17. Experimental Results: Imagery Tile Quality Computation 25 4Reduce 3.5 Reduce 20MapMap 3 Time (min)Time (min) 152.52 10 1.55 1 0.50 0 4 8 16 32 64 128 256 5122 4816 ReducersSize of data (GB)(a) Fixed data size, variable Reducers(b) Variable data size, fixed Reducers Fig. 9. MapReduce job completion time for tile quality computation17 Florida International University 18. 4. Related Work 19. Related Work Previous works on R-Tree parallel construction facedintrinsic distributed computing problems: loadbalancing, process scheduling, fault tolerance, etc. Schnitzer and Leutenegger [16] proposed a Master-Client R-Tree, where the data set is first partitionedusing Hilbert packing sort algorithm, then thepartitions are declustered into a number ofprocessors, where individual trees are built. At theend, a master process combines the individual treesinto the final R-Tree. Papadopoulos and Manolopoulos [17] proposed amethodology for sampling-based space partitionining,load balancing, and partition assignment into a set of Florida International University19processors in parallely building R-Trees. 20. 5. Conclusion 21. Conclusion We used the MapReduce model to solve twospatial problems on a Google&IBM cluster: (a) Bulk-construction of R-Trees and (b) Aerial image quality computation MapReduce can dramatically improve task completion times. Our experiments show close to linear scalability. Our experience in this work shows MapReduce has the potential to be applicable to more complex spatial problems.21 Florida International University 22. References [1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD 1984:47-57. [2] NSF Cluster Exploratory Program: http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm [3] Google&IBM Academic Cluster Computing Initiative: http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html [4] Apache Hadoop project: http://hadoop.apache.org [6] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, USENIX Association, Volume 6, pp. 10-10, December 2004. [12]High Performance Database Research Center (HPDRC), Research Division of the Florida International University, School of Computing and Information Sciences, University Park, Telephone: (305) 348-1706, FIU ECS-243, Miami, FL 33199. [16]Schnitzer B., Leutenegger S.T.: Master-client R-trees: a new parallel R- tree architecture, In Proceedings of the 11th International Conference on Scientific and Statistical Database Management, pp. 68-77, August 1999. [17]Apostolos Papadopoulos, Yannis Manolopoulos: Parallel bulk-loading of spatial data, Parallel Computing, Volume 29, Issue 10, pp. 1419 - 1444, October 2003.22 Florida International University

Date post:	14-Dec-2014
Category:	Technology
Upload:	lghost1201
View:	665 times
Download:	0 times

Experiences on Processing Spatial Data with MapReduce ssdbm09

Technology