Abstract
Entity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods attempt to group records that are likely to match. A proper evaluation of blocking methods for selecting the best one has a direct effect on the ultimate ER performance. Currently, the available metrics for evaluating blocking techniques exclusively assess their actual potential. However, it is possible to deduce new pairs from the identified ones in dirty datasets due to transitive closure between matching record pairs. In the present study, a modification of current metrics is proposed to obtain a more accurate evaluation of blocking methods taking into account transitive closure and the potential of blocking methods. Comparing the existing and proposed metrics for ten available blocking algorithms on two dirty datasets demonstrates that the proposed metrics correlate significantly with ER final performance.
Similar content being viewed by others
References
Bahmani, Z., Bertossi, L., & Vasiloglou, N. (2017). Erblox: Combining matching dependencies with machine learning for entity resolution. International Journal of Approximate Reasoning, 83, 118–141.
Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation.
Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage. In Sixth international conference on data mining (ICDM’06) (pp. 87–96). IEEE. https://doi.org/10.1109/ICDM.2006.13.
Christen, P. (2007a). Towards parameter-free blocking for scalable record linkage. Cluster Computing.
Christen, P. (2007b). A two-step classification approach to unsupervised record linkage. In Proceedings of the sixth Australasian conference on data mining and analytics - Volume 70 (pp. 111–119). AUS: Australian Computer Society, Inc.
Christen, P. (2008). Automatic training example selection for scalable unsupervised record linkage. In Pacific-Asia conference on knowledge discovery and data mining (pp. 511–518). Springer.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.
Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality measures in data mining (pp. 127–151). Springer.
Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 85–96). New York: Association for Computing Machinery. https://doi.org/10.1145/1066157.1066168.
Elfeky, M. G., Verykios, V. S., & Elmagarmid, A. K. (2002). Tailor: A record linkage toolbox. In Proceedings 18th International Conference on Data Engineering (pp. 17–28). IEEE.
Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 1(1), 1–16.
Hassanzadeh, O., Chiang, F., Lee, H. C., & Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1), 1282–1293.
Horowitz, E., Mehta, D. P., & Sahni, S. (1995). Fundamentals of data structures in c++, (p. 3). New York: W H Freeman & Co.
Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2 (1), 9–37.
Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In Proceedings of the 21st national conference on artificial intelligence - Volume 1 (pp. 440–445). AAAI Press.
Monge, A., & Elkan, C. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In DMKD.
Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. Synthesis Lectures on Data Management, 2(1), 1–87.
O’Hare, K., Jurek, A., & de Campos, C. (2018). A new technique of selecting an optimal blocking method for better record linkage. Information Systems.
O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019a). A review of unsupervised and semi-supervised blocking methods for record linkage. In Linking and Mining Heterogeneous and Multi-view Data (pp. 79–105). Springer.
O’Hare, K., Jurek-Loughrey, A., & de Campos, C. (2019b). An unsupervised blocking technique for more efficient record linkage. Data And Knowledge Engineering, 122(7), 181–195.
Papadakis, G., Alexiou, G., Papastefanatos, G., & Koutrika, G. (2015). Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proceedings of the VLDB Endowment, 9(4), 312–323.
Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2011). Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries. (pp. 85–94). New York: Association for Computing Machinery. https://doi.org/10.1145/1998076.1998093.
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., & Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2665–2682.
Papadakis, G., Koutrika, G., Palpanas, T., & Nejdl, W. (2014). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1946–1960.
Papadakis, G., Svirsky, J., Gal, A., & Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9), 684–695.
Papadakis, G., Skoutas, D., Thanos, E., & Palpanas, T. (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2), 1–42.
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2017). JedAI: The force behind entity resolution. In The semantic web: ESWC 2017 satellite events (pp. 161–166). Cham: Springer International Publishing.
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2018). The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proceedings of the VLDB Endowment, 11(12), 1950–1953.
Simonini, G., Bergamaschi, S., & Jagadish, H.V. (2016). Blast: A loosely schema-aware meta-blocking approach for entity resolution. Proceedings of the VLDB Endowment, 9(12), 1173–1184.
Subramaniyaswamy, V., & Pandian, S.C. (2012). A complete survey of duplicate record detection using data mining techniques. Information Technology Journal, 11(8), 941–945.
Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., & Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD international conference on management of data (pp. 219–232). New York: Association for Computing Machinery. https://doi.org/10.1145/1559845.1559870.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Niknam, M., Minaei-Bidgoli, B. & Dianat, R. The role of transitive closure in evaluating blocking methods for dirty entity resolution. J Intell Inf Syst 58, 561–590 (2022). https://doi.org/10.1007/s10844-021-00676-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-021-00676-3