research-article

Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy

Authors:

Ge YuAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 508 - 518

https://doi.org/10.1145/3627673.3679843

Published: 21 October 2024 Publication History

Abstract

Deep entity resolution (ER) identifies matching entities across data sources using techniques based on deep learning. It involves two steps: a blocker for identifying the potential matches to generate the candidate pairs, and a matcher for accurately distinguishing the matches and non-matches among these candidate pairs. Recent deep ER approaches utilize pretrained language models (PLMs) to extract similarity features for blocking and matching, achieving state-of-the-art performance. However, they often fail to balance the consensus and discrepancy between the blocker and matcher, emphasizing the consensus while neglecting the discrepancy. This paper proposes MutualER, a deep entity resolution framework that integrates and jointly trains the blocker and matcher, balancing both the consensus and discrepancy between them. Specifically, we firstly introduce a lightweight PLM in siamese structure for the blocker and a heavier PLM in cross structure or an autoregressive large language model (LLM) for the matcher. Two optimization techniques named Mutual Sample Selection (MSS) and Similarity Knowledge Transferring (SKT) are designed to jointly train the blocker and matcher. MSS enables the blocker and matcher to mutually select the customized training samples for each other to maintain the discrepancy, while SKT allows them to share the similarity knowledge for improving their blocking and matching capabilities respectively to maintain the consensus. Extensive experiments on five datasets demonstrate that MutualER significantly outperforms existing PLM-based and LLM-based approaches, achieving leading performance in both effectiveness and efficiency.

References

[1]

Akiko Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In International Workshop on Challenges in Web Information Retrieval and Integration. IEEE, 30--39.

Digital Library

[2]

Alexander Brinkmann, Roee Shraga, and Christian Bizer. 2023. SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines. arXiv preprint arXiv:2303.03132 (2023).

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.

[4]

Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. 2023. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749 (2023).

[5]

Sanjib Das, Paul Suganthan GC, AnHai Doan, Jeffrey F Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data. 1431--1446.

Digital Library

[6]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314 (2023).

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[8]

Wenzhou Dou, Derong Shen, Tiezheng Nie, Yue Kou, Chenchen Sun, Hang Cui, and Ge Yu. 2022. Empowering transformer with hybrid matching knowledge for entity matching. In Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, April 11--14, 2022, Proceedings, Part III. Springer, 52--67.

Digital Library

[9]

Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, and Ge Yu. 2023. Soft Target-Enhanced Matching Framework for Deep Entity Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4259--4266.

Digital Library

[10]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, Vol. 11, 11 (2018), 1454--1467.

Digital Library

[11]

Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Xiaoyong Du. 2024. Cost-effective in-context learning for entity resolution: A design space exploration. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 3696--3709.

[12]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 407--418.

Digital Library

[13]

Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc., Vol. 64, 328 (1969), 1183--1210.

[14]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2021. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3665--3671.

[15]

Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborEM: A self-supervised entity matching framework using multi-features collaboration. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 12 (2021), 12139--12152.

Digital Library

[16]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 601--612.

Digital Library

[17]

Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, Divesh Srivastava, et al. 2001. Approximate string joins in a database (almost) for free. In VLDB, Vol. 1. 491--500.

Digital Library

[18]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.

[19]

Mauricio A Hernández and Salvatore J Stolfo. 1995. The merge/purge problem for large databases. ACM Sigmod Record, Vol. 24, 2 (1995), 127--138.

Digital Library

[20]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[21]

Jiacheng Huang, Wei Hu, Zhifeng Bao, and Yuzhong Qu. 2020. Crowdsourced collective entity resolution with relational match propagation. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 37--48.

[22]

Arjit Jain, Sunita Sarawagi, and Prithviraj Sen. 2022. Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. In VLDB.

[23]

Delaram Javdani, Hossein Rahmani, Milad Allahgholi, and Fatemeh Karimkhani. 2019. Deepblock: A novel blocking approach for entity resolution using deep learning. In 2019 5th International Conference on Web Research (ICWR). IEEE, 41--44.

[24]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 5851--5861. https://doi.org/10.18653/v1/P19--1586

[25]

Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. 2016. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, Vol. 9, 13 (2016), 1581--1584.

Digital Library

[26]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Vol. 3, 1--2 (2010), 484--493.

Digital Library

[27]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel rahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:204960716

[28]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).

[29]

Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-To-End Multi-Task Learning With Attention. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1871--1880. https://doi.org/10.1109/CVPR.2019.00197

[30]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[31]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.

Digital Library

[32]

John Bosco Mugeni and Toshiyuki Amagasa. 2023. A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings. ACM SIGAPP Applied Computing Review, Vol. 22, 4 (2023), 37--46.

Digital Library

[33]

Ali Naeim abadi, Mir Tafseer Nayeem, and Davood Rafiei. 2023. Product Entity Matching via Tabular Data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4215--4219.

[34]

Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment, Vol. 16, 4 (2022), 738--746.

Digital Library

[35]

Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 629--638.

Digital Library

[36]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser. 2011. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the fourth ACM international conference on Web search and data mining. 535--544.

Digital Library

[37]

Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning for Product Matching. In Companion Proceedings of the Web Conference 2022. 248--251.

[38]

Ralph Peeters and Christian Bizer. 2023. Entity matching using large language models. arXiv preprint arXiv:2310.11244 (2023).

[39]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.

Digital Library

[40]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982--3992.

[41]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

[42]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, Vol. 11, 2 (2017), 189--202.

Digital Library

[43]

Khanin Sisaengsuwanchai, Navapat Nananukul, and Mayank Kejriwal. 2023. How does prompt engineering affect ChatGPT performance on unsupervised entity resolution? ArXiv, Vol. abs/2310.06174 (2023). https://api.semanticscholar.org/CorpusID:263829241

[44]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024).

[45]

Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep learning for blocking in entity matching: a design space exploration. Proceedings of the VLDB Endowment, Vol. 14, 11 (2021), 2459--2472.

Digital Library

[46]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[47]

Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing Algorithms for Entity Resolution. Proceedings of the VLDB Endowment, Vol. 7, 12 (2014).

Digital Library

[48]

Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. CrowdER: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, Vol. 5, 11 (2012), 1483--1494.

Digital Library

[49]

Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: prompt-tuning for low-resource generalized entity matching. Proceedings of the VLDB Endowment, Vol. 16, 2 (2022), 369--378.

Digital Library

[50]

Runhui Wang, Yuliang Li, and Jin Wang. 2023. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 1502--1515. https://doi.org/10.1109/ICDE55515.2023.00391

[51]

Xin Wang, Yudong Chen, and Wenwu Zhu. 2022. A Survey on Curriculum Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2022), 4555--4576. https://doi.org/10.1109/TPAMI.2021.3069908

[52]

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR

[53]

Shiwen Wu, Qiyu Wu, Honghua Dong, Wen Hua, and Xiaofang Zhou. 2023. Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity Resolution. Proceedings of the VLDB Endowment, Vol. 17, 3 (2023), 292--304.

Digital Library

[54]

Dezhong Yao, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. Entity resolution with hierarchical graph attention networks. In Proceedings of the 2022 International Conference on Management of Data. 429--442.

Digital Library

[55]

Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. 2020. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3903--3911.

[56]

Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020. Multi-context attention for entity matching. In Proceedings of The Web Conference 2020. 2634--2640.

Digital Library

[57]

Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. Autoblock: A hands-off blocking framework for entity matching. In Proceedings of the 13th International Conference on Web Search and Data Mining. 744--752.

Digital Library

[58]

Helong Zhou and Liangchen Song. 2021. Rethinking Soft Labels for Knowledge Distillation: A Bias--Variance Tradeoff Perspective. In Proceedings of International Conference on Learning Representations (ICLR).

Index Terms

Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy
1. Information systems
  1. Data management systems
    1. Information integration
      1. Deduplication
      2. Entity resolution

Recommendations

Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
Product Entity Matching via Tabular Data
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Product Entity Matching (PEM)--a subfield of record linkage that focuses on linking records that refer to the same product--is a challenging task for many entity matching models. For example, recent transformer models report a near-perfect performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Natural Science Foundation of China
the National Key Research and Development Program of China

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
88
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)25

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten