Skip to main content
Log in

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold \(\lambda\). In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Example 2.A
Example 2.B
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Ecml/pkdd porto taxi data. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i

  2. Ecml/pkdd porto taxi data. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i

References

  1. Alt, H., Godau, M.: Computing the fréchet distance between two polygonal curves. Int. J. Comput. Geomet. Appl. 05(1), 75–91 (1995)

    Article  Google Scholar 

  2. Baldus, J., Bringmann, K.: A fast implementation of near neighbors queries for fréchet distance (GIS cup). In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’17, pp. 1–4. Association for Computing Machinery (2017)

  3. Bamha, M.: An optimal and skew-insensitive join and multi-join algorithm for distributed architectures. In: Proceedings of the International Conference on Database and Expert Systems Applications (DEXA’2005). 22–26 August, Copenhagen, Danemark. LNCS, vol. 3588, pp. 616–625. Springer, New York (2005)

  4. Bamha, M., Exbrayat, M.: Pipelining a skew-insensitive parallel join algorithm. Parallel Process. Lett. 13(3), 317–328 (2003)

    Article  MathSciNet  Google Scholar 

  5. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E. J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 975–986. ACM, New York (2010)

  6. Bringmann, K.: Why walking the dog takes time: Frechet distance has no strongly subquadratic algorithms unless seth fails. In: Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, pp. 661–670. IEEE Computer Society, USA (2014)

  7. Buchin, K., Buchin, M., Meulemans, W., Mulzer, W.: Four soviets walk the dog: Improved bounds for computing the fréchet distance. Discret. Comput. Geomet. 58(1), 180–216 (2017)

    Article  Google Scholar 

  8. Ceccarello, M., Driemel, A., Silvestri, F.: Fresh: Fréchet similarity with hashing. In: Friggstad, Z., Sack, J.-R., Salavatipour, M.R. (eds.) Algorithms and Data Structures, pp. 254–268. Springer International Publishing, Cham (2019)

    Chapter  Google Scholar 

  9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Driemel, A., Har-Peled, S., Wenk, C.: Approximating the fréchet distance for realistic curves in near linear time. Discret. Comput. Geomet. 48(1), 94–127 (2012)

    Article  Google Scholar 

  11. Driemel, A., Silvestri, F.: Locality-Sensitive Hashing of Curves. In: B. Aronov and M.J. Katz (eds.) 33rd International Symposium on Computational Geometry (SoCG 2017) Leibniz International Proceedings in Informatics (LIPIcs), vol. 77, pp. 37:1–37:16. Dagstuhl, Germany, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2017)

  12. Florence, P.S.: Human behaviour and the principle of least effort. Econ. J. 60(240), 808–810 (1950)

    Article  Google Scholar 

  13. Hassan, M.A.H., Bamha, M.: Towards scalability and data skew handling in groupby-joins using mapreduce model. Procedia Comput. Sci. 51, 70–79 (2015)

    Article  Google Scholar 

  14. Hassan, M.A.H., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using mapreduce. Procedia Comput. Sci. 29, 145–158 (2014)

    Article  Google Scholar 

  15. Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 79–90. ACM, New York (2017)

  16. Indyk, P.: Approximate nearest neighbor algorithms for frechet distance via product metrics. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry—SCG ’02, pp. 102–106. ACM Press, New York (2002)

  17. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. Association for Computing Machinery, New York, NY (1998)

  18. Konzack, M., Mcketterick, T.J., Ophelders, T., Buchin, M., Giuggioli, L., Long, J., Nelson, T., Westenberg, M.A., Buchin, K.: Visual analytics of delays and interaction in movement data. Int. J. Geogr. Inf. Sci. 31(2), 320–345 (2017)

    Article  Google Scholar 

  19. Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)

    Article  Google Scholar 

  20. Sriraghavendra, E., Bhattacharyya, K.K., Fréchet, C.: distance based approach for searching online handwritten documents. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 461–465. IEEE Computer Society (2007)

  21. Werner, M., Oliver, D.: ACM SIGSPATIAL GIS cup 2017: range queries under fréchet distance. SIGSPATIAL Special 10(1), 24–27 (2018)

    Article  Google Scholar 

  22. Xie, D., Li, F., Phillips, J.M.: Distributed trajectory similarity search. Proc. VLDB Endowment 10(11), 1478–1489 (2017)

    Article  Google Scholar 

  23. Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1262–1273. IEEE (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mostafa Bamha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rivault, S., Bamha, M., Limet, S. et al. A Scalable Similarity Join Algorithm Based on MapReduce and LSH. Int J Parallel Prog 50, 360–380 (2022). https://doi.org/10.1007/s10766-022-00733-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-022-00733-6

Keywords

Navigation