Enhancing Sedona (formerly GeoSpark) with Efficient k Nearest Neighbor Join Processing

García-García, Francisco; Corral, Antonio; Iribarne, Luis; Vassilakopoulos, Michael

doi:10.1007/978-3-030-78428-7_24

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12732))

Included in the following conference series:

International Conference on Model and Data Engineering

832 Accesses
1 Citations

Abstract

Sedona (formerly GeoSpark) is an in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, indexes, and operations (e.g., spatial range, k Nearest Neighbor (kNN) and spatial join queries). k Nearest Neighbor Join Query (kNNJQ) finds for each object in one dataset \(\mathbb {P}\), k nearest neighbors of this object in another dataset \(\mathbb {Q}\). It is a common operation used in numerous spatial applications (e.g., GISs, location-based systems, continuous monitoring, etc.). kNNJQ is a time-consuming spatial operation, since it can be considered a hybrid of spatial join and nearest neighbor search. Given that Sedona outperforms other Spark-based spatial analytics systems in most cases and, it does not support kNN joins, including kNNJQ is a worthwhile challenge. Therefore, in this paper, we investigate how to design and implement an efficient kNNJQ algorithm in Sedona, using the most appropriate spatial partitioning technique and other improvements. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ algorithm is efficient, scalable and robust in Sedona.

Research of all authors is supported by the MINECO research project [TIN2017-83964-R].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at https://spark.apache.org/.
2.
Available at http://sedona.apache.org/.
3.
see http://sedona.apache.org/download/features/.
4.
Available at https://github.com/acgtic211/incubator-sedona/tree/KNNJ.
5.
Available at http://spatialhadoop.cs.umn.edu/datasets.html.
6.
Available at https://github.com/apache/incubator-sedona.

References

Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2016). https://doi.org/10.1109/TKDE.2015.2503768
Fu, Z., Yu, J., Sarwat, M.: Demonstrating geosparksim: A scalable microscopic road network traffic simulator based on apache spark. In: SSTD Conference, pp. 186–189 (2019). https://doi.org/10.1145/3340964.3340984
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M.: Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop. Future Gener. Comput. Syst. 111, 723–740 (2020). https://doi.org/10.1016/j.future.2019.10.037
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Efficient distance join query processing in distributed spatial data management systems. Inf. Sci. 512, 985–1008 (2020). https://doi.org/10.1016/j.ins.2019.10.030
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018). https://doi.org/10.1016/j.bdr.2017.05.001
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. PVLDB 5(10), 1016–1027 (2012). https://doi.org/10.14778/2336664.2336674
Nodarakis, N., Pitoura, E., Sioutas, S., Tsakalidis, A.K., Tsoumakos, D., Tzimas, G.: kdann+: a rapid aknn classifier for big data. Trans. Large-Scale Data Knowl. Centered Syst. 24, 139–168 (2016). https://doi.org/10.1007/978-3-662-49214-7_5
Pandey, V., Kipf, A., Neumann, T., Kemper, A.: How good are modern spatial analytics systems? PVLDB 11(11), 1661–1673 (2018). https://doi.org/10.14778/3236187.3236213
Tang, M., Yu, Y., Mahmood, A.R., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: In-memory distributed spatial query processing and optimization. Front. Big Data 3, 30 (2020). https://doi.org/10.3389/fdata.2020.00030
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD Conference, pp. 1071–1085 (2016). https://doi.org/10.1145/2882903.2915237
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015). https://doi.org/10.1109/ICDEW.2015.7129541
Yu, J., Zhang, Z., Sarwat, M.: Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem. In: SSDBM Conference, pp. 15:1–15:12 (2018). https://doi.org/10.1145/3221269.3223040
Yu, J., Zhang, Z., Sarwat, M.: Spatial data management in apache spark: the GeoSpark perspective and beyond. Geo Informatica 23(1), 37–78 (2018). https://doi.org/10.1007/s10707-018-0330-9
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp. 38–49 (2012). https://doi.org/10.1145/2247596.2247602
Zhao, X., Zhang, J., Qin, X.: knn-dp: handling data skewness in kNN joins using mapreduce. IEEE Trans. Parallel Distrib. Syst. 29(3), 600–613 (2018). https://doi.org/10.1109/TPDS.2017.2767596

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Almeria, Almeria, Spain
Francisco García-García, Antonio Corral & Luis Iribarne
Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
Michael Vassilakopoulos

Authors

Francisco García-García
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Corral
View author publications
You can also search for this author in PubMed Google Scholar
Luis Iribarne
View author publications
You can also search for this author in PubMed Google Scholar
Michael Vassilakopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Corral .

Editor information

Editors and Affiliations

University of Nantes, Nantes, France
Christian Attiogbé
Tallinn University of Technology, Tallinn, Estonia
Sadok Ben Yahia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M. (2021). Enhancing Sedona (formerly GeoSpark) with Efficient k Nearest Neighbor Join Processing. In: Attiogbé, C., Ben Yahia, S. (eds) Model and Data Engineering. MEDI 2021. Lecture Notes in Computer Science(), vol 12732. Springer, Cham. https://doi.org/10.1007/978-3-030-78428-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-78428-7_24
Published: 14 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78427-0
Online ISBN: 978-3-030-78428-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics