Parallelizing String Similarity Join Algorithms

Yao, Ling-Chih; Lim, Lipyeow

doi:10.1007/978-3-319-92013-9_27

Ling-Chih Yao¹⁷ &
Lipyeow Lim¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10837))

Included in the following conference series:

Australasian Database Conference

1067 Accesses

Abstract

A key operation in data cleaning and integration is the use of string similarity join (SSJ) algorithms to identify and remove duplicates or similar records within data sets. With the advent of big data, a natural question is how to parallelize SSJ algorithms. There is a large body of existing work on SSJ algorithms and parallelizing each one of them may not be the most feasible solution. In this paper, we propose a parallelization framework for string similarity joins that utilizes existing SSJ algorithms. Our framework partitions the data using a variety of partitioning strategies and then executes the SSJ algorithms on the partitions in parallel. Some of the partitioning strategies that we investigate trade accuracy for speed. We implemented and validated our framework on several SSJ algorithms and data sets. Our experiments show that our framework results in significant speedup with little loss in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140. ACM (2007)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986. ACM (2010)
Google Scholar
Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University of Zurich (2007)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. IEEE (2006)
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, vol. 23, pp. 426–435 (1997)
Google Scholar
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. VLDB 9(4), 360–371 (2015)
Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Chapter Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D., et al.: Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp. 491–500 (2001)
Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. VLDB 5(3), 253–264 (2011)
Google Scholar
Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. VLDB 5(5), 430–441 (2012)
Google Scholar
Sohrabi, M.K., Azgomi, H.: Parallel set similarity join on big data based on locality-sensitive hashing. Sci. Comput. Program. 145, 1–12 (2017)
Article Google Scholar
Sun, J., Shang, Z., Li, G., Deng, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. VLDB 10(12), 1925–1928 (2017)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96. ACM (2012)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2008)
MathSciNet Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140. ACM (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Hawai‘i at Mānoa, Honolulu, HI, 96822, USA
Ling-Chih Yao & Lipyeow Lim

Authors

Ling-Chih Yao
View author publications
You can also search for this author in PubMed Google Scholar
Lipyeow Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lipyeow Lim .

Editor information

Editors and Affiliations

ICT, Griffith University, Southport, Queensland, Australia
Junhu Wang
Nanyang Technological University, Singapore, Singapore
Gao Cong
Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Victoria, Australia
Jinjun Chen
The University of Melbourne, Melbourne, Victoria, Australia
Jianzhong Qi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, LC., Lim, L. (2018). Parallelizing String Similarity Join Algorithms. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds) Databases Theory and Applications. ADC 2018. Lecture Notes in Computer Science(), vol 10837. Springer, Cham. https://doi.org/10.1007/978-3-319-92013-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-92013-9_27
Published: 18 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92012-2
Online ISBN: 978-3-319-92013-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics