Skip to main content
Log in

Cost estimation of spatial join in spatialhadoop

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. From the official Hadoop documentation, the maximum number of parallel reducers could be set equal to the number of available containers multiplied by a factor of 0.95.

References

  1. An N, Yang Z, Sivasubramaniam A (2001) Selectivity estimation for spatial joins. In: Proceedings of the 17th International Conference on Data Engineering, pp 368–375. https://doi.org/10.1109/ICDE.2001.914849

  2. Aref W, Samet H (1994) A cost model for query optimization using R-Trees. In: Proceedings of the Second ACM Workshop on Advances in Geographic Information Systems, pp 60–67

  3. Belussi A, Carra D, Migliorini S, Negri M, Pelagatti G (2018) What makes spatial data big? A discussion on how to partition spatial data. In: Proceedings of 10th International Conference on Geographic Information Science, pp 1–15. https://doi.org/10.1109/ICDE.2015.7113382

  4. Belussi A, Migliorini S, Eldawy A (2018) A Cost Model for Spatial Join Operations in SpatialHadoop. Tech. Rep. RR108/2018, Dept. of Computer Science, University of Verona. https://iris.univr.it/handle/11562/981957

  5. Belussi A, Migliorini S, Eldawy A (2018) Detecting Skewness of Big Spatial Data in SpatialHadoop. In: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 432–435. https://doi.org/10.1145/3274895.3274923

  6. van den Bercken J, Seeger B, Widmayer P (1999) The bulk index join: a generic approach to processing non-equijoins. In: Proceedings of the 15th International Conference on Data Engineering, pp 257–. https://doi.org/10.1109/ICDE.1999.754937

  7. Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A Comparison of Join Algorithms for Log Processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp 975–986. https://doi.org/10.1145/1807167.1807273

  8. Chasparis H, Eldawy A (2017) Experimental Evaluation of Selectivity Estimation on Big Spatial Data. In: Proceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (GeoRich 2017), collocated with ACM SIGMOD 2017, pp 8:1–8:6. Chicago. https://doi.org/10.1145/3080546.3080553

  9. Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Lomet DB, Weikum G (eds) Proceedings of the 16th International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2000.839452. IEEE Computer Society, San Diego, pp 535–546

  10. Eldawy A, Alarabi L, Mokbel MF (2015) Spatial Partitioning Techniques in SpatialHadoop. Proc VLDB Endow 8(12):1602–1605. https://doi.org/10.14778/2824032.2824057

    Article  Google Scholar 

  11. Eldawy A, Mokbel MF (2015) SpatialHadoop: A MapReduce framework for spatial data. In: Proceedings of the 31st IEEE International Conference on Data Engineering, pp 1352–1363. https://doi.org/10.1109/ICDE.2015.7113382

  12. Eldawy A, Mokbel MF (2017) Spatial Join with Hadoop. In: Encyclopedia of GIS. Springer, pp 2032–2036. https://doi.org/10.1007/978-3-319-17885-1_1570

  13. Eldawy A, Mokbel MF (2017) The Era of Big Spatial Data. Proc VLDB Endow 10(12):1992–1995. https://doi.org/10.14778/3137765.3137828

    Article  Google Scholar 

  14. Eldawy A, Sabek I, Elganainy M, Bakeer A, Abdelmotaleb A, Mokbel MF (2017) Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Proceedings of the 15th International Symposium on Advances in Spatial and Temporal Databases, pp 65–83. https://doi.org/10.1007/978-3-319-64367-0_4

  15. Gu J, Peng S, Wang XS, Rao W, Yang M, Cao Y (2014) Cost-Based Join Algorithm Selection in Hadoop. In: Proceedings of the 15th International Conference on Web Information Systems Engineering, pp 246–261. https://doi.org/10.1007/978-3-319-11746-1_18

  16. Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark. In: PRoceedings of the International Conference on Research in Adaptive and Convergent Systems (RACS), pp 275–280. https://doi.org/10.1145/3129676.3129688

  17. Harada L, Nakano M, Kitsuregawa M, Takagi M (1990) Query Processing for Multi-Attribute Clustered Records. In: Proceedings of 16th International Conferece on Very Large Data Bases, pp 59–70

  18. Hoel EG, Samet H (1995) Benchmarking Spatial Join Operations with Spatial Output. In: Proceedings of the 21th International Conference on Very Large Data Bases, pp 606–618

  19. Jacox EH, Samet H (2007) Spatial Join Techniques. ACM Transactions on Database Systems 32(1). https://doi.org/10.1145/1206049.1206056

  20. Lin X, Meng Z, Xu C, Wang M (2012) A Practical Performance Model for Hadoop MapReduce. In: Proceeding of the 2012 IEEE international conference on cluster computing workshops, pp 231–239. https://doi.org/10.1109/ClusterW.2012.24

  21. Mamoulis N, Papadias D (2001) Multiway Spatial Joins. ACM Trans Database Syst 26(4), 424–475. https://doi.org/10.1145/503099.503101

  22. Mavridis I, Karatza H (2017) Performance Evaluation of Cloud-based Log File Analysis with Apache Hadoop and Apache Spark. J Syst Softw 125 (C):133–151. https://doi.org/10.1016/j.jss.2016.11.037

    Article  Google Scholar 

  23. Nievergelt J, Hinterberger H, Sevcik KC (1981) The grid file: An adaptable, symmetric multi-key file structure. In: Proceedings of 3rd Conference of the European Cooperation in Informatics – Trends in Information Processing Systems, pp 236–251. https://doi.org/10.1007/3-540-10885-8_45

  24. Papadopoulos A, Rigaux P, Scholl M (1999) A Performance Evaluation of Spatial Join Processing Strategies. In: Proceedings of 6th International Symposium on Advances in Spatial Databases, pp 286–307. https://doi.org/10.1007/3-540-48482-5_18

  25. Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. SIGMOD Rec 25(2):259–270. https://doi.org/10.1145/235968.233338

    Article  Google Scholar 

  26. Rigaux P, Scholl M, Voisard A (2002) Spatial Databases with Application to GIS. Morgan Kaufmann Publishers Inc., San Francisco

  27. Sabek I, Mokbel MF (2017) On Spatial Joins in MapReduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 21:1–21:10. https://doi.org/10.1145/3139958.3139967

  28. Samadi Y, Zbakh M, Tadonki C (2018) Performance comparison between Hadoop and Spark frameworks using HiBenchbenchmarks Concurrency and Computation: Practice and Experience 30(12). https://doi.org/10.1002/cpe.4367

  29. Siddique AB, Eldawy A, Hristidis V (2019) Comparing Synopsis Techniques for Approximate Spatial Data Analysis. PVLDB 12(11):1583–1596. https://doi.org/10.14778/3342263.3342635, http://www.vldb.org/pvldb/vol12/p1583-siddique.pdf

  30. Sowell B, Salles MV, Cao T, Demers A, Gehrke J (2013) An Experimental Analysis of Iterated Spatial Joins in Main Memory. Proc VLDB Endow 6 (14):1882–1893. https://doi.org/10.14778/2556549.2556570

    Article  Google Scholar 

  31. Šidlauskas D, Jensen CS (2014) Spatial Joins in Main Memory: Implementation Matters!. Proc VLDB Endow 8(1):97–100. https://doi.org/10.14778/2735461.2735470

    Article  Google Scholar 

  32. White T (2015) HAdoop: The Definitive Guide, 4th. O’Reilly Media, Inc

  33. Xie D, Li F, Yao B, Li G, Chen Z, Zhou L, Guo M (2016) Simba: spatial in-memory big data analysis. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 86:1–86:4. https://doi.org/10.1145/2996913.2996935

  34. Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 70:1–70:4. https://doi.org/10.1145/2820783.2820860

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Migliorini.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Belussi, A., Migliorini, S. & Eldawy, A. Cost estimation of spatial join in spatialhadoop. Geoinformatica 24, 1021–1059 (2020). https://doi.org/10.1007/s10707-020-00414-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-020-00414-x

Keywords

Navigation