Scaling Up Set Similarity Joins Using a Cost-Based Distributed-Parallel Framework

Fier, Fabian; Freytag, Johann-Christoph

doi:10.1007/978-3-030-89657-7_2

Fabian Fier¹⁵ &
Johann-Christoph Freytag¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13058))

Included in the following conference series:

International Conference on Similarity Search and Applications

837 Accesses

Abstract

The set similarity join (SSJ) is an important operation in data science. For example, the SSJ operation relates data from different sources or finds plagiarism. Common SSJ approaches are based on the filter-and-verification framework. Existing approaches are sequential (single-core), use multi-threading, or Map-Reduce-based distributed parallelization. The amount of data to be processed today is large and keeps growing. On the other hand, the SSJ is a compute-intensive operation. None of the existing SSJ methods scales to large datasets. Single- and multi-core-based methods are limited in terms of hardware. MapReduce-based methods do not scale due to too high and/or skewed data replication. We propose a novel, highly scalable distributed SSJ approach. It overcomes the limits and bottlenecks of existing parallel SSJ approaches. With a cost-based heuristic and a data-independent scaling mechanism we avoid intra-node data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. A RAM usage estimation prevents swapping, which is critical for the runtime. Our approach significantly scales up the SSJ execution and processes much larger datasets than all parallel approaches designed so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our implementation is available at https://github.com/fabiyon/dist-ssj-sisap.

References

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International Conference on World Wide Web (2007)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: International Conference on Data Engineering (ICDE) (2006)
Google Scholar
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.C.: Set similarity joins on MapReduce: an experimental survey. In: Proceedings of the International Conference on Very Large Data Bases (PVLDB) (2018)
Google Scholar
Fier, F., Freytag, J.C.: Scaling up set similarity joins using a cost-based distributed-parallel framework [extended paper] (2021). https://doi.org/10.18452/23209
Fier, F., Wang, T., Zhu, E., Freytag, J.-C.: Parallelizing filter-verification based exact set similarity joins on multicores. In: Satoh, S., et al. (eds.) SISAP 2020. LNCS, vol. 12440, pp. 62–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60936-8_5
Chapter Google Scholar
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. In: Proceedings of the International Conference on Very Large Data Bases (PVLDB) (2016)
Google Scholar

Download references

Acknowledgements

This work was supported by a research grant from LexisNexis Risk Solutions.

Author information

Authors and Affiliations

Humboldt-Universität zu Berlin, Berlin, Germany
Fabian Fier & Johann-Christoph Freytag

Authors

Fabian Fier
View author publications
You can also search for this author in PubMed Google Scholar
Johann-Christoph Freytag
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Fier .

Editor information

Editors and Affiliations

National University of San Luis, San Luis, Argentina
Nora Reyes
University of St Andrews, St Andrews, UK
Richard Connor
University of Vienna, Vienna, Austria
Nils Kriege
Kiel University, Kiel, Germany
Daniyal Kazempour
University of Bologna, Bologna, Italy
Ilaria Bartolini
TU Dortmund University, Dortmund, Germany
Erich Schubert
TU Dortmund University, Dortmund, Germany
Jian-Jia Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fier, F., Freytag, JC. (2021). Scaling Up Set Similarity Joins Using a Cost-Based Distributed-Parallel Framework. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-89657-7_2
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics