SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

Ribeiro, Leonardo Andrade; Cuzzocrea, Alfredo; Bezerra, Karen Aline Alves; do Nascimento, Ben Hur Bahia

doi:10.1007/978-3-662-58384-5_4

Leonardo Andrade Ribeiro¹⁷,
Alfredo Cuzzocrea¹⁸,
Karen Aline Alves Bezerra¹⁹ &
…
Ben Hur Bahia do Nascimento¹⁹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 11250))

221 Accesses
2 Citations

Abstract

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm for grouping together records referring to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this article we propose and experimentally evaluate SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results derived from an extensive experimental campaign show that we outperform previous approaches by an order of magnitude in most settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For ease of notation, the parameter \(\tau \) is omitted.
2.
A secondary ordering is used to break ties consistently (e.g., the lexicographic ordering).
3.
http://dblab.cs.toronto.edu/project/stringer/clustering/.
4.
http://www.cs.utexas.edu/users/ml/riddle/data/dbgen.tar.gz.
5.
http://dblab.cs.toronto.edu/project/stringer/datasets/sample.htm.

References

Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. PVLDB 6(14), 1846–1857 (2013)
Google Scholar
Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. PVLDB 9(3), 120–131 (2015)
Google Scholar
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: Proceedings of the ICDE Conference, p. 30 (2006)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval - The Concepts and Technology Behind Search, 2 edn. Pearson Education Limited, Harlow, England (2011)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the WWW Conference, pp. 131–140 (2007)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. The VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. PVLDB 2(1), 598–609 (2009)
Google Scholar
Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and XML. In: WebDyn 2002 (2002)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the SIGMOD Conference, pp. 313–324 (2003)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5 (2006)
Google Scholar
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration on the Web, pp. 73–78 (2003)
Google Scholar
Doan, A.H., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Waltham (2012)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)
Google Scholar
Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Google Scholar
Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)
Article Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the SIGMOD Conference, pp. 127–138 (1995)
Article Google Scholar
Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the SIGMOD Conference, pp. 277–281 (2015)
Google Scholar
Kazimianec, M., Augsten, N.: PG-Skip: proximity graph based clustering of long strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6588, pp. 31–46. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20152-3_3
Chapter Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)
Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proceedings of the SIGMOD Conference, pp. 802–803 (2006)
Google Scholar
Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. LNCS, vol. 7790, pp. 174–196. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37574-3_8
Chapter Google Scholar
Liu, H., Ashwin Kumar, T.K, Thomas, J.P.: Cleaning framework for big data - object identification and linkage. In: Proceedings of the Big Data Congress, pp. 215–221 (2015)
Google Scholar
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)
Google Scholar
Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: Proceedings of the VLDB Workshop on Clean Databases (2006)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the SIGKDD Conference, pp. 169–178 (2000)
Google Scholar
Menestrina, D., Whang, S., Garcia-Molina, H.: Evaluating entity resolution results. PVLDB 3(1), 208–219 (2010)
Google Scholar
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: towards a framework for integrating similarity join algorithms and clustering. In: Proceedings of the ICEIS Conference (2016)
Google Scholar
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: Incorporating clustering into set similarity join algorithms: the SjClust framework. In: Hartmann, S., Ma, H. (eds.) DEXA 2016. LNCS, vol. 9827, pp. 185–204. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44403-1_12
Chapter Google Scholar
Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)
Article Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the SIGMOD Conference, pp. 743–754 (2004)
Google Scholar
Schneider, N.C., Ribeiro, L.A., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: SimDataMapper: an architectural pattern to integrate declarative similarity matching into database applications. In: Proceedings of the SBBD Conference, pp. 967–972 (2015)
Google Scholar
Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the SAC Conference, pp. 967–972 (2015)
Google Scholar
Tang, N.: Big RDF data cleaning. In: Proceedings of the ICDE Conference Workshops, pp. 77–79 (2015)
Google Scholar
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)
Article Google Scholar
Zhang, F., Xue, H.-F., Xu, D.-S., Zhang, Y.-H., You, F.: Big data cleaning algorithms in cloud computing. iJOE 9(3), 77–81 (2013)
Google Scholar

Download references

Acknowledgments

This research was partially supported by the Brazilian agencies CNPq and CAPES.

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal de Goiás, Goiânia, Goiás, Brazil
Leonardo Andrade Ribeiro
DIA Department, University of Trieste and ICAR-CNR, Trieste, Italy
Alfredo Cuzzocrea
Departmento de Ciência da Computação, Universidade Federal de Lavras, Lavras, Brazil
Karen Aline Alves Bezerra & Ben Hur Bahia do Nascimento

Authors

Leonardo Andrade Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuzzocrea
View author publications
You can also search for this author in PubMed Google Scholar
Karen Aline Alves Bezerra
View author publications
You can also search for this author in PubMed Google Scholar
Ben Hur Bahia do Nascimento
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Roland Wagner
Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Victoria University of Wellington, Wellington, New Zealand
Hui Ma

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2018). SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms. In: Hameurlain, A., Wagner, R., Hartmann, S., Ma, H. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII. Lecture Notes in Computer Science(), vol 11250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58384-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-662-58384-5_4
Published: 22 November 2018
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58383-8
Online ISBN: 978-3-662-58384-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics