Skip to main content

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII

Abstract

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm for grouping together records referring to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this article we propose and experimentally evaluate SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results derived from an extensive experimental campaign show that we outperform previous approaches by an order of magnitude in most settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For ease of notation, the parameter \(\tau \) is omitted.

  2. 2.

    A secondary ordering is used to break ties consistently (e.g., the lexicographic ordering).

  3. 3.

    http://dblab.cs.toronto.edu/project/stringer/clustering/.

  4. 4.

    http://www.cs.utexas.edu/users/ml/riddle/data/dbgen.tar.gz.

  5. 5.

    http://dblab.cs.toronto.edu/project/stringer/datasets/sample.htm.

References

  1. Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. PVLDB 6(14), 1846–1857 (2013)

    Google Scholar 

  2. Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. PVLDB 9(3), 120–131 (2015)

    Google Scholar 

  3. Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: Proceedings of the ICDE Conference, p. 30 (2006)

    Google Scholar 

  4. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval - The Concepts and Technology Behind Search, 2 edn. Pearson Education Limited, Harlow, England (2011)

    Google Scholar 

  5. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the WWW Conference, pp. 131–140 (2007)

    Google Scholar 

  6. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. The VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  7. Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. PVLDB 2(1), 598–609 (2009)

    Google Scholar 

  8. Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and XML. In: WebDyn 2002 (2002)

    Google Scholar 

  9. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the SIGMOD Conference, pp. 313–324 (2003)

    Google Scholar 

  10. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5 (2006)

    Google Scholar 

  11. Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  12. Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration on the Web, pp. 73–78 (2003)

    Google Scholar 

  13. Doan, A.H., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Waltham (2012)

    Google Scholar 

  14. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)

    Google Scholar 

  15. Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)

    Google Scholar 

  16. Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)

    Article  Google Scholar 

  17. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the SIGMOD Conference, pp. 127–138 (1995)

    Article  Google Scholar 

  18. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the SIGMOD Conference, pp. 277–281 (2015)

    Google Scholar 

  19. Kazimianec, M., Augsten, N.: PG-Skip: proximity graph based clustering of long strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6588, pp. 31–46. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20152-3_3

    Chapter  Google Scholar 

  20. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)

    Google Scholar 

  21. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proceedings of the SIGMOD Conference, pp. 802–803 (2006)

    Google Scholar 

  22. Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. LNCS, vol. 7790, pp. 174–196. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37574-3_8

    Chapter  Google Scholar 

  23. Liu, H., Ashwin Kumar, T.K, Thomas, J.P.: Cleaning framework for big data - object identification and linkage. In: Proceedings of the Big Data Congress, pp. 215–221 (2015)

    Google Scholar 

  24. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

    Google Scholar 

  25. Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: Proceedings of the VLDB Workshop on Clean Databases (2006)

    Google Scholar 

  26. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the SIGKDD Conference, pp. 169–178 (2000)

    Google Scholar 

  27. Menestrina, D., Whang, S., Garcia-Molina, H.: Evaluating entity resolution results. PVLDB 3(1), 208–219 (2010)

    Google Scholar 

  28. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: towards a framework for integrating similarity join algorithms and clustering. In: Proceedings of the ICEIS Conference (2016)

    Google Scholar 

  29. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: Incorporating clustering into set similarity join algorithms: the SjClust framework. In: Hartmann, S., Ma, H. (eds.) DEXA 2016. LNCS, vol. 9827, pp. 185–204. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44403-1_12

    Chapter  Google Scholar 

  30. Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)

    Article  Google Scholar 

  31. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the SIGMOD Conference, pp. 743–754 (2004)

    Google Scholar 

  32. Schneider, N.C., Ribeiro, L.A., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: SimDataMapper: an architectural pattern to integrate declarative similarity matching into database applications. In: Proceedings of the SBBD Conference, pp. 967–972 (2015)

    Google Scholar 

  33. Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the SAC Conference, pp. 967–972 (2015)

    Google Scholar 

  34. Tang, N.: Big RDF data cleaning. In: Proceedings of the ICDE Conference Workshops, pp. 77–79 (2015)

    Google Scholar 

  35. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)

    Google Scholar 

  36. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)

    Article  Google Scholar 

  37. Zhang, F., Xue, H.-F., Xu, D.-S., Zhang, Y.-H., You, F.: Big data cleaning algorithms in cloud computing. iJOE 9(3), 77–81 (2013)

    Google Scholar 

Download references

Acknowledgments

This research was partially supported by the Brazilian agencies CNPq and CAPES.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2018). SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms. In: Hameurlain, A., Wagner, R., Hartmann, S., Ma, H. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII. Lecture Notes in Computer Science(), vol 11250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58384-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-58384-5_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-58383-8

  • Online ISBN: 978-3-662-58384-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics