Skip to main content

Parallelizing String Similarity Join Algorithms

  • Conference paper
  • First Online:
Databases Theory and Applications (ADC 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10837))

Included in the following conference series:

  • 1067 Accesses

Abstract

A key operation in data cleaning and integration is the use of string similarity join (SSJ) algorithms to identify and remove duplicates or similar records within data sets. With the advent of big data, a natural question is how to parallelize SSJ algorithms. There is a large body of existing work on SSJ algorithms and parallelizing each one of them may not be the most feasible solution. In this paper, we propose a parallelization framework for string similarity joins that utilizes existing SSJ algorithms. Our framework partitions the data using a variety of partitioning strategies and then executes the SSJ algorithms on the partitions in parallel. Some of the partitioning strategies that we investigate trade accuracy for speed. We implemented and validated our framework on several SSJ algorithms and data sets. Our experiments show that our framework results in significant speedup with little loss in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140. ACM (2007)

    Google Scholar 

  2. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986. ACM (2010)

    Google Scholar 

  3. Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University of Zurich (2007)

    Google Scholar 

  4. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. IEEE (2006)

    Google Scholar 

  5. Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, vol. 23, pp. 426–435 (1997)

    Google Scholar 

  6. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. VLDB 9(4), 360–371 (2015)

    Google Scholar 

  7. Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19

    Chapter  Google Scholar 

  8. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D., et al.: Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp. 491–500 (2001)

    Google Scholar 

  9. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. VLDB 5(3), 253–264 (2011)

    Google Scholar 

  10. Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. VLDB 5(5), 430–441 (2012)

    Google Scholar 

  11. Sohrabi, M.K., Azgomi, H.: Parallel set similarity join on big data based on locality-sensitive hashing. Sci. Comput. Program. 145, 1–12 (2017)

    Article  Google Scholar 

  12. Sun, J., Shang, Z., Li, G., Deng, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. VLDB 10(12), 1925–1928 (2017)

    Google Scholar 

  13. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)

    Google Scholar 

  14. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96. ACM (2012)

    Google Scholar 

  15. Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2008)

    MathSciNet  Google Scholar 

  16. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140. ACM (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lipyeow Lim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yao, LC., Lim, L. (2018). Parallelizing String Similarity Join Algorithms. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds) Databases Theory and Applications. ADC 2018. Lecture Notes in Computer Science(), vol 10837. Springer, Cham. https://doi.org/10.1007/978-3-319-92013-9_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92013-9_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92012-2

  • Online ISBN: 978-3-319-92013-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics