Skip to main content

Enhancing Sedona (formerly GeoSpark) with Efficient k Nearest Neighbor Join Processing

  • Conference paper
  • First Online:
Model and Data Engineering (MEDI 2021)

Abstract

Sedona (formerly GeoSpark) is an in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, indexes, and operations (e.g., spatial range, k Nearest Neighbor (kNN) and spatial join queries). k Nearest Neighbor Join Query (kNNJQ) finds for each object in one dataset \(\mathbb {P}\), k nearest neighbors of this object in another dataset \(\mathbb {Q}\). It is a common operation used in numerous spatial applications (e.g., GISs, location-based systems, continuous monitoring, etc.). kNNJQ is a time-consuming spatial operation, since it can be considered a hybrid of spatial join and nearest neighbor search. Given that Sedona outperforms other Spark-based spatial analytics systems in most cases and, it does not support kNN joins, including kNNJQ is a worthwhile challenge. Therefore, in this paper, we investigate how to design and implement an efficient kNNJQ algorithm in Sedona, using the most appropriate spatial partitioning technique and other improvements. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ algorithm is efficient, scalable and robust in Sedona.

Research of all authors is supported by the MINECO research project [TIN2017-83964-R].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at https://spark.apache.org/.

  2. 2.

    Available at http://sedona.apache.org/.

  3. 3.

    see http://sedona.apache.org/download/features/.

  4. 4.

    Available at https://github.com/acgtic211/incubator-sedona/tree/KNNJ.

  5. 5.

    Available at http://spatialhadoop.cs.umn.edu/datasets.html.

  6. 6.

    Available at https://github.com/apache/incubator-sedona.

References

  1. Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2016). https://doi.org/10.1109/TKDE.2015.2503768

  2. Fu, Z., Yu, J., Sarwat, M.: Demonstrating geosparksim: A scalable microscopic road network traffic simulator based on apache spark. In: SSTD Conference, pp. 186–189 (2019). https://doi.org/10.1145/3340964.3340984

  3. García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M.: Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop. Future Gener. Comput. Syst. 111, 723–740 (2020). https://doi.org/10.1016/j.future.2019.10.037

  4. García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Efficient distance join query processing in distributed spatial data management systems. Inf. Sci. 512, 985–1008 (2020). https://doi.org/10.1016/j.ins.2019.10.030

  5. Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018). https://doi.org/10.1016/j.bdr.2017.05.001

  6. Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. PVLDB 5(10), 1016–1027 (2012). https://doi.org/10.14778/2336664.2336674

  7. Nodarakis, N., Pitoura, E., Sioutas, S., Tsakalidis, A.K., Tsoumakos, D., Tzimas, G.: kdann+: a rapid aknn classifier for big data. Trans. Large-Scale Data Knowl. Centered Syst. 24, 139–168 (2016). https://doi.org/10.1007/978-3-662-49214-7_5

  8. Pandey, V., Kipf, A., Neumann, T., Kemper, A.: How good are modern spatial analytics systems? PVLDB 11(11), 1661–1673 (2018). https://doi.org/10.14778/3236187.3236213

  9. Tang, M., Yu, Y., Mahmood, A.R., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: In-memory distributed spatial query processing and optimization. Front. Big Data 3, 30 (2020). https://doi.org/10.3389/fdata.2020.00030

  10. Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD Conference, pp. 1071–1085 (2016). https://doi.org/10.1145/2882903.2915237

  11. You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015). https://doi.org/10.1109/ICDEW.2015.7129541

  12. Yu, J., Zhang, Z., Sarwat, M.: Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem. In: SSDBM Conference, pp. 15:1–15:12 (2018). https://doi.org/10.1145/3221269.3223040

  13. Yu, J., Zhang, Z., Sarwat, M.: Spatial data management in apache spark: the GeoSpark perspective and beyond. Geo Informatica 23(1), 37–78 (2018). https://doi.org/10.1007/s10707-018-0330-9

  14. Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp. 38–49 (2012). https://doi.org/10.1145/2247596.2247602

  15. Zhao, X., Zhang, J., Qin, X.: knn-dp: handling data skewness in kNN joins using mapreduce. IEEE Trans. Parallel Distrib. Syst. 29(3), 600–613 (2018). https://doi.org/10.1109/TPDS.2017.2767596

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Corral .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M. (2021). Enhancing Sedona (formerly GeoSpark) with Efficient k Nearest Neighbor Join Processing. In: Attiogbé, C., Ben Yahia, S. (eds) Model and Data Engineering. MEDI 2021. Lecture Notes in Computer Science(), vol 12732. Springer, Cham. https://doi.org/10.1007/978-3-030-78428-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78428-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78427-0

  • Online ISBN: 978-3-030-78428-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics