ABSTRACT
This paper provides the first attempt for a full-fledged query optimizer for MapReduce-based spatial join algorithms. The optimizer develops its own taxonomy that covers almost all possible ways of doing a spatial join for any two input datasets. The optimizer comes in two flavors; cost-based and rule-based. Given two input data sets, the cost-based query optimizer evaluates the costs of all possible options in the developed taxonomy, and selects the one with the lowest cost. The rule-based query optimizer abstracts the developed cost models of the cost-based optimizer into a set of simple easy-to-check heuristic rules. Then, it applies its rules to select the lowest cost option. Both query optimizers are deployed and experimentally evaluated inside a widely used open-source MapReduce-based big spatial data system. Exhaustive experiments show that both query optimizers are always successful in taking the right decision for spatially joining any two datasets of up to 500GB each.
- ESRI Tools on Hadoop. http://esri.github.io/gis-tools-for-hadoop/.Google Scholar
- Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. Hadoop GIS: A High Performance Spatial Data Warehousing System over Mapreduce. PVLDB, 6(11), 2013. Google ScholarDigital Library
- Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. AsterixDB: A Scalable, Open Source BDMS. PVLDB, 2014. Google ScholarDigital Library
- Ning An, Zhen-Yu Yang, and Anand Sivasubramaniam. Selectivity Estimation for Spatial Joins. In ICDE, 2001. Google ScholarDigital Library
- Walid G. Aref and Hanan Samet. A Cost Model for Query Optimization Using R-Trees. In SIGSPATIAL, 1994.Google Scholar
- Lars Arge, Octavian Procopiuc, Sridhar Ramaswamy, Torsten Suel, Jan Vahrenhold, and Jeffrey Vitter. A Unified Approach for Indexed and Non-indexed Spatial Joins. In EDBT, 2000. Google ScholarDigital Library
- Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. CACM, 1975. Google ScholarDigital Library
- Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. Multi-step Processing of Spatial Joins. SIGMOD Record, 23(2), 1994. Google ScholarDigital Library
- Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. Efficient Processing of Spatial Joins Using R-trees. In SIGMOD, 1993. Google ScholarDigital Library
- Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. Parallel Processing of Spatial Joins using R-trees. In ICDE, 1996. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1), 2008. Google ScholarDigital Library
- Jochen Van den Bercken, Bernhard Seeger, and Peter Widmayer. The Bulk Index Join: A Generic Approach to Processing Non-Equijoins. In ICDE, 1999.Google ScholarCross Ref
- Jens-Peter Dittrich and Bernhard Seeger. Data Redundancy and Duplicate Detection in Spatial Join Processing. In ICDE, 2000.Google ScholarCross Ref
- Ahmed Eldawy, Louai Alarabi, and Mohamed F. Mokbel. Spatial Partitioning Techniques in SpatialHadoop. PVLDB, 8(12), 2015. Google ScholarDigital Library
- Ahmed Eldawy, Yuan Li, Mohamed F. Mokbel, and Ravi Janardan. CGHadoop: Computational Geometry in MapReduce. In SIGSPATIAL, 2013. Google ScholarDigital Library
- Ahmed Eldawy and Mohamed F. Mokbel. SpatialHadoop: A MapReduce Framework for Spatial Data. In ICDE, 2015.Google Scholar
- Ahmed Eldawy and Mohamed F. Mokbel. The Era of Big Spatial Data. In ICDE, 2016.Google ScholarCross Ref
- Christos Faloutsos, Bernhard Seeger, Agma Traina, and Caetano Traina Jr. Spatial Join Selectivity Using Power Laws. In SIGMOD, 2000. Google ScholarDigital Library
- R.A. Finkel and J.L. Bentley. Quad Trees a Data Structure for Retrieval on Composite Keys. Acta Informatica, 1974. Google ScholarDigital Library
- Miguel R. Fornari, Joao Luiz D. Comba, and Cirano Iochpe. Query Optimizer for Spatial Join Operations. In GIS, 2006. Google ScholarDigital Library
- Huijun Gao, Hao Zhang, Daosheng Hu, Ran Tian, and Dazhi Guo. Multi-scale Features of Urban Planning Spatial Data. In Geoinformatics, 2010.Google Scholar
- Oliver GuÌĹnther. Efficient Computation of Spatial Joins. In ICDE, 1993.Google Scholar
- Oliver Gunther, Vincent Oria, Philippe Picouet, Jean-Marc Saglio, and Michel Scholl. Benchmarking Spatial Joins A La Carte. In SSDM, 1998. Google ScholarDigital Library
- Himanshu Gupta, Bhupesh Chawda, Sumit Negi, Tanveer A. Faruquie, L. V. Subramaniam, and Mukesh Mohania. Processing Multi-way Spatial Joins on Map-reduce. In EDBT, 2013. Google ScholarDigital Library
- Christophe Gurret and Philippe Rigaux. The Sort/Sweep Algorithm: A New Method for R-tree based Spatial Joins. In SSDM, 2000. Google ScholarDigital Library
- A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Rec., 1984. Google ScholarDigital Library
- Lilian Harada, Miyuki Nakano, Masaru Kitsuregawa, and Mikio Takagi. Query Processing for Multi-Attribute Clustered Records. In VLDB, 1990. Google ScholarDigital Library
- Erik G. Hoel and Hanan Samet. Benchmarking Spatial Join Operations with Spatial Output. In VLDB, 1995. Google ScholarDigital Library
- Edwin H. Jacox and Hanan Samet. Iterative Spatial Join. TODS, 28(3), 2003. Google ScholarDigital Library
- Edwin H. Jacox and Hanan Samet. Spatial Join Techniques. TODS, 32(1), 2007. Google ScholarDigital Library
- Jin-Deog Kim and Bong-Hee Hong. Parallel Spatial Join Algorithms using Grid Files. In DANTE, 1999. Google ScholarDigital Library
- Scott T. Leutenegger, Mario A. Lopez, and Jeffrey Edgington. STR: A Simple and Efficient Algorithm for R-tree Packing. In ICDE, 1997. Google ScholarDigital Library
- Ming-Ling Lo and Chinya Ravishankar. Spatial Joins Using Seeded Trees. In SIGMOD, 1994. Google ScholarDigital Library
- Jiamin Lu and Ralf Hartmut Guting. Parallel Secondo: Boosting Database Engines with Hadoop. In ICPADS, 2012. Google ScholarDigital Library
- Gang Luo, Jeffrey F. Naughton, and Curt J. Ellmann. A Non-blocking Parallel Spatial Join Algorithm. In ICDE, 2002.Google Scholar
- Nikos Mamoulis, Panos Kalnis, Spiridon Bakiras, and Xiaochen Li. Optimization of Spatial Joins on Mobile Devices. In SSTD, 2003.Google ScholarCross Ref
- Henry Markram, Karlheinz Meier, Thomas Lippert, Sten Grillner, Richard Frackowiak, Stanislas Dehaene, Alois Knoll, Haim Sompolinsky, Kris Verstreken, Javier DeFelipe, Seth Grant, Jean-Pierre Changeux, and Alois Saria. Introducing the human brain project. Procedia Computer Science, 2011.Google ScholarCross Ref
- J. Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. The Grid File: An Adaptable, Symmetric Multikey File Structure. TODS, 9(1), 1984. Google ScholarDigital Library
- J. Nievergelt and F. P. Preparata. Plane-sweep Algorithms for Intersecting Geometric Figures. CACM, 1982. Google ScholarDigital Library
- OpenStreetMap. https://www.openstreetmap.org/.Google Scholar
- Apostolos Papadopoulos, Philippe Rigaux, and Michel Scholl. A Performance Evaluation of Spatial Join Processing Strategies. Adv. in Spatial Databases, 1999. Google ScholarDigital Library
- Jignesh M. Patel and David J. DeWitt. Partition Based Spatial-merge Join. In SIGMOD, 1996. Google ScholarDigital Library
- Jignesh M. Patel and David J. DeWitt. Clone Join and Shadow Join: Two Parallel Spatial Join Algorithms. In GIS, 2000. Google ScholarDigital Library
- Satish Puri, Dinesh Agarwal, Xi He, and Sushil K. Prasad. MapReduce Algorithms for GIS Polygonal Overlay Processing. In IPDPSW, 2013. Google ScholarDigital Library
- Darius Sidlauskas and Christian S. Jensen. Spatial Joins in Main Memory: Implementation Matters! PVLDB, 8(1), 2014. Google ScholarDigital Library
- Benjamin Sowell, Marcos Vaz Salles, Tuan Cao, Alan Demers, and Johannes Gehrke. An Experimental Analysis of Iterated Spatial Joins in Main Memory. PVLDB, 6(14), 2013. Google ScholarDigital Library
- Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. Selectivity Estimation for Spatial Joins with Geometric Selections. In EDBT, 2002. Google ScholarDigital Library
- Kai Wang, Jizhong Han, Bibo Tu, Jiao Dai, Wei Zhou, and Xuan Song. Accelerating Spatial Data Processing with MapReduce. In ICPADS, 2010. Google ScholarDigital Library
- Kaibo Wang, Yin Huai, Rubao Lee, Fusheng Wang, Xiaodong Zhang, and Joel Saltz. Accelerating Pathology Image Data Cross-comparison on CPU-GPU Hybrid Systems. PVLDB, 2012. Google ScholarDigital Library
- Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, and Erik G. Hoel. Spatial Indexing and Analytics on Hadoop. In SIGSPATIAL, 2014. Google ScholarDigital Library
- Shubin Zhang, Jizhong Han, Zhiyong Liu, Kai Wang, and Shengzhong Feng. Spatial Queries Evaluation with MapReduce. In GCC, 2009. Google ScholarDigital Library
- Shubin Zhang, Jizhong Han, Zhiyong Liu, Kai Wang, and Zhiyong Xu. SJMR: Parallelizing spatial join with MapReduce on clusters. In CLUSTER, 2009.Google ScholarCross Ref
- Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen. Towards Parallel Spatial Query Processing for Big Spatial Data. In IPDPSW, 2012. Google ScholarDigital Library
- Xiaofang Zhou, David J. Abel, and David Truffet. Data Partitioning for Parallel Spatial Join Processing. Geoinformatica, 1998. Google ScholarDigital Library
Index Terms
On Spatial Joins in MapReduce
Recommendations
Scalable 3D spatial queries for analytical pathology imaging with MapReduce
SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems3D analytical pathology imaging examines high resolution 3D image volumes of human tissues to facilitate biomedical research and provide potential effective diagnostic assistance. Such approach - quantitative analysis of large- scale 3D pathology image ...
Exploiting MapReduce-based similarity joins
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataCloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller ...
Optimizing Spatial Queries in MapReduce
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
Comments