Abstract
Sedona (formerly GeoSpark) is an in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, indexes, and operations (e.g., spatial range, k Nearest Neighbor (kNN) and spatial join queries). k Nearest Neighbor Join Query (kNNJQ) finds for each object in one dataset \(\mathbb {P}\), k nearest neighbors of this object in another dataset \(\mathbb {Q}\). It is a common operation used in numerous spatial applications (e.g., GISs, location-based systems, continuous monitoring, etc.). kNNJQ is a time-consuming spatial operation, since it can be considered a hybrid of spatial join and nearest neighbor search. Given that Sedona outperforms other Spark-based spatial analytics systems in most cases and, it does not support kNN joins, including kNNJQ is a worthwhile challenge. Therefore, in this paper, we investigate how to design and implement an efficient kNNJQ algorithm in Sedona, using the most appropriate spatial partitioning technique and other improvements. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ algorithm is efficient, scalable and robust in Sedona.
Research of all authors is supported by the MINECO research project [TIN2017-83964-R].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at https://spark.apache.org/.
- 2.
Available at http://sedona.apache.org/.
- 3.
- 4.
Available at https://github.com/acgtic211/incubator-sedona/tree/KNNJ.
- 5.
Available at http://spatialhadoop.cs.umn.edu/datasets.html.
- 6.
Available at https://github.com/apache/incubator-sedona.
References
Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2016). https://doi.org/10.1109/TKDE.2015.2503768
Fu, Z., Yu, J., Sarwat, M.: Demonstrating geosparksim: A scalable microscopic road network traffic simulator based on apache spark. In: SSTD Conference, pp. 186–189 (2019). https://doi.org/10.1145/3340964.3340984
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M.: Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop. Future Gener. Comput. Syst. 111, 723–740 (2020). https://doi.org/10.1016/j.future.2019.10.037
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Efficient distance join query processing in distributed spatial data management systems. Inf. Sci. 512, 985–1008 (2020). https://doi.org/10.1016/j.ins.2019.10.030
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018). https://doi.org/10.1016/j.bdr.2017.05.001
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. PVLDB 5(10), 1016–1027 (2012). https://doi.org/10.14778/2336664.2336674
Nodarakis, N., Pitoura, E., Sioutas, S., Tsakalidis, A.K., Tsoumakos, D., Tzimas, G.: kdann+: a rapid aknn classifier for big data. Trans. Large-Scale Data Knowl. Centered Syst. 24, 139–168 (2016). https://doi.org/10.1007/978-3-662-49214-7_5
Pandey, V., Kipf, A., Neumann, T., Kemper, A.: How good are modern spatial analytics systems? PVLDB 11(11), 1661–1673 (2018). https://doi.org/10.14778/3236187.3236213
Tang, M., Yu, Y., Mahmood, A.R., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: In-memory distributed spatial query processing and optimization. Front. Big Data 3, 30 (2020). https://doi.org/10.3389/fdata.2020.00030
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD Conference, pp. 1071–1085 (2016). https://doi.org/10.1145/2882903.2915237
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015). https://doi.org/10.1109/ICDEW.2015.7129541
Yu, J., Zhang, Z., Sarwat, M.: Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem. In: SSDBM Conference, pp. 15:1–15:12 (2018). https://doi.org/10.1145/3221269.3223040
Yu, J., Zhang, Z., Sarwat, M.: Spatial data management in apache spark: the GeoSpark perspective and beyond. Geo Informatica 23(1), 37–78 (2018). https://doi.org/10.1007/s10707-018-0330-9
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp. 38–49 (2012). https://doi.org/10.1145/2247596.2247602
Zhao, X., Zhang, J., Qin, X.: knn-dp: handling data skewness in kNN joins using mapreduce. IEEE Trans. Parallel Distrib. Syst. 29(3), 600–613 (2018). https://doi.org/10.1109/TPDS.2017.2767596
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M. (2021). Enhancing Sedona (formerly GeoSpark) with Efficient k Nearest Neighbor Join Processing. In: Attiogbé, C., Ben Yahia, S. (eds) Model and Data Engineering. MEDI 2021. Lecture Notes in Computer Science(), vol 12732. Springer, Cham. https://doi.org/10.1007/978-3-030-78428-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-78428-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78427-0
Online ISBN: 978-3-030-78428-7
eBook Packages: Computer ScienceComputer Science (R0)