The role of transitive closure in evaluating blocking methods for dirty entity resolution

Niknam, Mahdi; Minaei-Bidgoli, Behrouz; Dianat, Rouhollah

doi:10.1007/s10844-021-00676-3

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Published: 19 October 2021

Volume 58, pages 561–590, (2022)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Mahdi Niknam¹,
Behrouz Minaei-Bidgoli² &
Rouhollah Dianat¹

299 Accesses
2 Citations
Explore all metrics

Abstract

Entity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods attempt to group records that are likely to match. A proper evaluation of blocking methods for selecting the best one has a direct effect on the ultimate ER performance. Currently, the available metrics for evaluating blocking techniques exclusively assess their actual potential. However, it is possible to deduce new pairs from the identified ones in dirty datasets due to transitive closure between matching record pairs. In the present study, a modification of current metrics is proposed to obtain a more accurate evaluation of blocking methods taking into account transitive closure and the potential of blocking methods. Comparing the existing and proposed metrics for ten available blocking algorithms on two dirty datasets demonstrates that the proposed metrics correlate significantly with ER final performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Record Searching Using Dynamic Blocking for Entity Resolution Systems

Entity Resolution for Multiple Sources with Extended Approach

References

Bahmani, Z., Bertossi, L., & Vasiloglou, N. (2017). Erblox: Combining matching dependencies with machine learning for entity resolution. International Journal of Approximate Reasoning, 83, 118–141.
Article MathSciNet Google Scholar
Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation.
Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage. In Sixth international conference on data mining (ICDM’06) (pp. 87–96). IEEE. https://doi.org/10.1109/ICDM.2006.13.
Christen, P. (2007a). Towards parameter-free blocking for scalable record linkage. Cluster Computing.
Christen, P. (2007b). A two-step classification approach to unsupervised record linkage. In Proceedings of the sixth Australasian conference on data mining and analytics - Volume 70 (pp. 111–119). AUS: Australian Computer Society, Inc.
Christen, P. (2008). Automatic training example selection for scalable unsupervised record linkage. In Pacific-Asia conference on knowledge discovery and data mining (pp. 511–518). Springer.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.
Article Google Scholar
Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality measures in data mining (pp. 127–151). Springer.
Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 85–96). New York: Association for Computing Machinery. https://doi.org/10.1145/1066157.1066168.
Elfeky, M. G., Verykios, V. S., & Elmagarmid, A. K. (2002). Tailor: A record linkage toolbox. In Proceedings 18th International Conference on Data Engineering (pp. 17–28). IEEE.
Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 1(1), 1–16.
Article Google Scholar
Hassanzadeh, O., Chiang, F., Lee, H. C., & Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1), 1282–1293.
Article Google Scholar
Horowitz, E., Mehta, D. P., & Sahni, S. (1995). Fundamentals of data structures in c++, (p. 3). New York: W H Freeman & Co.
Google Scholar
Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2 (1), 9–37.
Article Google Scholar
Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In Proceedings of the 21st national conference on artificial intelligence - Volume 1 (pp. 440–445). AAAI Press.
Monge, A., & Elkan, C. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In DMKD.
Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1), 1–87.
Article Google Scholar
O’Hare, K., Jurek, A., & de Campos, C. (2018). A new technique of selecting an optimal blocking method for better record linkage. Information Systems.
O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019a). A review of unsupervised and semi-supervised blocking methods for record linkage. In Linking and Mining Heterogeneous and Multi-view Data (pp. 79–105). Springer.
O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019b). An unsupervised blocking technique for more efficient record linkage. Data And Knowledge Engineering, 122(7), 181–195.
Article Google Scholar
Papadakis, G., Alexiou, G., Papastefanatos, G., & Koutrika, G. (2015). Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proceedings of the VLDB Endowment, 9(4), 312–323.
Article Google Scholar
Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2011). Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries. (pp. 85–94). New York: Association for Computing Machinery. https://doi.org/10.1145/1998076.1998093.
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., & Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2665–2682.
Article Google Scholar
Papadakis, G., Koutrika, G., Palpanas, T., & Nejdl, W. (2014). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1946–1960.
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., & Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9), 684–695.
Article Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., & Palpanas, T. (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2), 1–42.
Article Google Scholar
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2017). JedAI: The force behind entity resolution. In The semantic web: ESWC 2017 satellite events (pp. 161–166). Cham: Springer International Publishing.
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2018). The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proceedings of the VLDB Endowment, 11(12), 1950–1953.
Article Google Scholar
Simonini, G., Bergamaschi, S., & Jagadish, H.V. (2016). Blast: A loosely schema-aware meta-blocking approach for entity resolution. Proceedings of the VLDB Endowment, 9(12), 1173–1184.
Article Google Scholar
Subramaniyaswamy, V., & Pandian, S.C. (2012). A complete survey of duplicate record detection using data mining techniques. Information Technology Journal, 11(8), 941–945.
Article Google Scholar
Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., & Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD international conference on management of data (pp. 219–232). New York: Association for Computing Machinery. https://doi.org/10.1145/1559845.1559870.

Download references

Author information

Authors and Affiliations

Faculty of Computer Engineering, University of Qom, Qom, Iran
Mahdi Niknam & Rouhollah Dianat
Faculty of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Behrouz Minaei-Bidgoli

Authors

Mahdi Niknam
View author publications
You can also search for this author in PubMed Google Scholar
Behrouz Minaei-Bidgoli
View author publications
You can also search for this author in PubMed Google Scholar
Rouhollah Dianat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahdi Niknam.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niknam, M., Minaei-Bidgoli, B. & Dianat, R. The role of transitive closure in evaluating blocking methods for dirty entity resolution. J Intell Inf Syst 58, 561–590 (2022). https://doi.org/10.1007/s10844-021-00676-3

Download citation

Received: 18 February 2021
Revised: 20 July 2021
Accepted: 05 September 2021
Published: 19 October 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10844-021-00676-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Abstract

Access this article

Similar content being viewed by others

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Record Searching Using Dynamic Blocking for Entity Resolution Systems

Entity Resolution for Multiple Sources with Extended Approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Abstract

Access this article

Similar content being viewed by others

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Record Searching Using Dynamic Blocking for Entity Resolution Systems

Entity Resolution for Multiple Sources with Extended Approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation