Skip to main content
Log in

The role of transitive closure in evaluating blocking methods for dirty entity resolution

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Entity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods attempt to group records that are likely to match. A proper evaluation of blocking methods for selecting the best one has a direct effect on the ultimate ER performance. Currently, the available metrics for evaluating blocking techniques exclusively assess their actual potential. However, it is possible to deduce new pairs from the identified ones in dirty datasets due to transitive closure between matching record pairs. In the present study, a modification of current metrics is proposed to obtain a more accurate evaluation of blocking methods taking into account transitive closure and the potential of blocking methods. Comparing the existing and proposed metrics for ten available blocking algorithms on two dirty datasets demonstrates that the proposed metrics correlate significantly with ER final performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Bahmani, Z., Bertossi, L., & Vasiloglou, N. (2017). Erblox: Combining matching dependencies with machine learning for entity resolution. International Journal of Approximate Reasoning, 83, 118–141.

    Article  MathSciNet  Google Scholar 

  • Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation.

  • Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage. In Sixth international conference on data mining (ICDM’06) (pp. 87–96). IEEE. https://doi.org/10.1109/ICDM.2006.13.

  • Christen, P. (2007a). Towards parameter-free blocking for scalable record linkage. Cluster Computing.

  • Christen, P. (2007b). A two-step classification approach to unsupervised record linkage. In Proceedings of the sixth Australasian conference on data mining and analytics - Volume 70 (pp. 111–119). AUS: Australian Computer Society, Inc.

  • Christen, P. (2008). Automatic training example selection for scalable unsupervised record linkage. In Pacific-Asia conference on knowledge discovery and data mining (pp. 511–518). Springer.

  • Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.

    Article  Google Scholar 

  • Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality measures in data mining (pp. 127–151). Springer.

  • Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 85–96). New York: Association for Computing Machinery. https://doi.org/10.1145/1066157.1066168.

  • Elfeky, M. G., Verykios, V. S., & Elmagarmid, A. K. (2002). Tailor: A record linkage toolbox. In Proceedings 18th International Conference on Data Engineering (pp. 17–28). IEEE.

  • Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 1(1), 1–16.

    Article  Google Scholar 

  • Hassanzadeh, O., Chiang, F., Lee, H. C., & Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1), 1282–1293.

    Article  Google Scholar 

  • Horowitz, E., Mehta, D. P., & Sahni, S. (1995). Fundamentals of data structures in c++, (p. 3). New York: W H Freeman & Co.

    Google Scholar 

  • Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2 (1), 9–37.

    Article  Google Scholar 

  • Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In Proceedings of the 21st national conference on artificial intelligence - Volume 1 (pp. 440–445). AAAI Press.

  • Monge, A., & Elkan, C. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In DMKD.

  • Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1), 1–87.

    Article  Google Scholar 

  • O’Hare, K., Jurek, A., & de Campos, C. (2018). A new technique of selecting an optimal blocking method for better record linkage. Information Systems.

  • O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019a). A review of unsupervised and semi-supervised blocking methods for record linkage. In Linking and Mining Heterogeneous and Multi-view Data (pp. 79–105). Springer.

  • O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019b). An unsupervised blocking technique for more efficient record linkage. Data And Knowledge Engineering, 122(7), 181–195.

    Article  Google Scholar 

  • Papadakis, G., Alexiou, G., Papastefanatos, G., & Koutrika, G. (2015). Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proceedings of the VLDB Endowment, 9(4), 312–323.

    Article  Google Scholar 

  • Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2011). Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries. (pp. 85–94). New York: Association for Computing Machinery. https://doi.org/10.1145/1998076.1998093.

  • Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., & Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2665–2682.

    Article  Google Scholar 

  • Papadakis, G., Koutrika, G., Palpanas, T., & Nejdl, W. (2014). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1946–1960.

    Article  Google Scholar 

  • Papadakis, G., Svirsky, J., Gal, A., & Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9), 684–695.

    Article  Google Scholar 

  • Papadakis, G., Skoutas, D., Thanos, E., & Palpanas, T. (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2), 1–42.

    Article  Google Scholar 

  • Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2017). JedAI: The force behind entity resolution. In The semantic web: ESWC 2017 satellite events (pp. 161–166). Cham: Springer International Publishing.

  • Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2018). The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proceedings of the VLDB Endowment, 11(12), 1950–1953.

    Article  Google Scholar 

  • Simonini, G., Bergamaschi, S., & Jagadish, H.V. (2016). Blast: A loosely schema-aware meta-blocking approach for entity resolution. Proceedings of the VLDB Endowment, 9(12), 1173–1184.

    Article  Google Scholar 

  • Subramaniyaswamy, V., & Pandian, S.C. (2012). A complete survey of duplicate record detection using data mining techniques. Information Technology Journal, 11(8), 941–945.

    Article  Google Scholar 

  • Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., & Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD international conference on management of data (pp. 219–232). New York: Association for Computing Machinery. https://doi.org/10.1145/1559845.1559870.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahdi Niknam.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Niknam, M., Minaei-Bidgoli, B. & Dianat, R. The role of transitive closure in evaluating blocking methods for dirty entity resolution. J Intell Inf Syst 58, 561–590 (2022). https://doi.org/10.1007/s10844-021-00676-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-021-00676-3

Keywords

Navigation