skip to main content
10.1145/3627673.3679843acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy

Published: 21 October 2024 Publication History

Abstract

Deep entity resolution (ER) identifies matching entities across data sources using techniques based on deep learning. It involves two steps: a blocker for identifying the potential matches to generate the candidate pairs, and a matcher for accurately distinguishing the matches and non-matches among these candidate pairs. Recent deep ER approaches utilize pretrained language models (PLMs) to extract similarity features for blocking and matching, achieving state-of-the-art performance. However, they often fail to balance the consensus and discrepancy between the blocker and matcher, emphasizing the consensus while neglecting the discrepancy. This paper proposes MutualER, a deep entity resolution framework that integrates and jointly trains the blocker and matcher, balancing both the consensus and discrepancy between them. Specifically, we firstly introduce a lightweight PLM in siamese structure for the blocker and a heavier PLM in cross structure or an autoregressive large language model (LLM) for the matcher. Two optimization techniques named Mutual Sample Selection (MSS) and Similarity Knowledge Transferring (SKT) are designed to jointly train the blocker and matcher. MSS enables the blocker and matcher to mutually select the customized training samples for each other to maintain the discrepancy, while SKT allows them to share the similarity knowledge for improving their blocking and matching capabilities respectively to maintain the consensus. Extensive experiments on five datasets demonstrate that MutualER significantly outperforms existing PLM-based and LLM-based approaches, achieving leading performance in both effectiveness and efficiency.

References

[1]
Akiko Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In International Workshop on Challenges in Web Information Retrieval and Integration. IEEE, 30--39.
[2]
Alexander Brinkmann, Roee Shraga, and Christian Bizer. 2023. SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines. arXiv preprint arXiv:2303.03132 (2023).
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.
[4]
Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. 2023. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749 (2023).
[5]
Sanjib Das, Paul Suganthan GC, AnHai Doan, Jeffrey F Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data. 1431--1446.
[6]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314 (2023).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[8]
Wenzhou Dou, Derong Shen, Tiezheng Nie, Yue Kou, Chenchen Sun, Hang Cui, and Ge Yu. 2022. Empowering transformer with hybrid matching knowledge for entity matching. In Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, April 11--14, 2022, Proceedings, Part III. Springer, 52--67.
[9]
Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, and Ge Yu. 2023. Soft Target-Enhanced Matching Framework for Deep Entity Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4259--4266.
[10]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, Vol. 11, 11 (2018), 1454--1467.
[11]
Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Xiaoyong Du. 2024. Cost-effective in-context learning for entity resolution: A design space exploration. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 3696--3709.
[12]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 407--418.
[13]
Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc., Vol. 64, 328 (1969), 1183--1210.
[14]
Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2021. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3665--3671.
[15]
Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborEM: A self-supervised entity matching framework using multi-features collaboration. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 12 (2021), 12139--12152.
[16]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 601--612.
[17]
Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, Divesh Srivastava, et al. 2001. Approximate string joins in a database (almost) for free. In VLDB, Vol. 1. 491--500.
[18]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
[19]
Mauricio A Hernández and Salvatore J Stolfo. 1995. The merge/purge problem for large databases. ACM Sigmod Record, Vol. 24, 2 (1995), 127--138.
[20]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[21]
Jiacheng Huang, Wei Hu, Zhifeng Bao, and Yuzhong Qu. 2020. Crowdsourced collective entity resolution with relational match propagation. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 37--48.
[22]
Arjit Jain, Sunita Sarawagi, and Prithviraj Sen. 2022. Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. In VLDB.
[23]
Delaram Javdani, Hossein Rahmani, Milad Allahgholi, and Fatemeh Karimkhani. 2019. Deepblock: A novel blocking approach for entity resolution using deep learning. In 2019 5th International Conference on Web Research (ICWR). IEEE, 41--44.
[24]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 5851--5861. https://doi.org/10.18653/v1/P19--1586
[25]
Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. 2016. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, Vol. 9, 13 (2016), 1581--1584.
[26]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Vol. 3, 1--2 (2010), 484--493.
[27]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel rahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:204960716
[28]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).
[29]
Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-To-End Multi-Task Learning With Attention. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1871--1880. https://doi.org/10.1109/CVPR.2019.00197
[30]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[31]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.
[32]
John Bosco Mugeni and Toshiyuki Amagasa. 2023. A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings. ACM SIGAPP Applied Computing Review, Vol. 22, 4 (2023), 37--46.
[33]
Ali Naeim abadi, Mir Tafseer Nayeem, and Davood Rafiei. 2023. Product Entity Matching via Tabular Data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4215--4219.
[34]
Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment, Vol. 16, 4 (2022), 738--746.
[35]
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 629--638.
[36]
George Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser. 2011. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the fourth ACM international conference on Web search and data mining. 535--544.
[37]
Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning for Product Matching. In Companion Proceedings of the Web Conference 2022. 248--251.
[38]
Ralph Peeters and Christian Bizer. 2023. Entity matching using large language models. arXiv preprint arXiv:2310.11244 (2023).
[39]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.
[40]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982--3992.
[41]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[42]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, Vol. 11, 2 (2017), 189--202.
[43]
Khanin Sisaengsuwanchai, Navapat Nananukul, and Mayank Kejriwal. 2023. How does prompt engineering affect ChatGPT performance on unsupervised entity resolution? ArXiv, Vol. abs/2310.06174 (2023). https://api.semanticscholar.org/CorpusID:263829241
[44]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024).
[45]
Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep learning for blocking in entity matching: a design space exploration. Proceedings of the VLDB Endowment, Vol. 14, 11 (2021), 2459--2472.
[46]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[47]
Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing Algorithms for Entity Resolution. Proceedings of the VLDB Endowment, Vol. 7, 12 (2014).
[48]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. CrowdER: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, Vol. 5, 11 (2012), 1483--1494.
[49]
Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: prompt-tuning for low-resource generalized entity matching. Proceedings of the VLDB Endowment, Vol. 16, 2 (2022), 369--378.
[50]
Runhui Wang, Yuliang Li, and Jin Wang. 2023. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 1502--1515. https://doi.org/10.1109/ICDE55515.2023.00391
[51]
Xin Wang, Yudong Chen, and Wenwu Zhu. 2022. A Survey on Curriculum Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2022), 4555--4576. https://doi.org/10.1109/TPAMI.2021.3069908
[52]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
[53]
Shiwen Wu, Qiyu Wu, Honghua Dong, Wen Hua, and Xiaofang Zhou. 2023. Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity Resolution. Proceedings of the VLDB Endowment, Vol. 17, 3 (2023), 292--304.
[54]
Dezhong Yao, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. Entity resolution with hierarchical graph attention networks. In Proceedings of the 2022 International Conference on Management of Data. 429--442.
[55]
Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. 2020. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3903--3911.
[56]
Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020. Multi-context attention for entity matching. In Proceedings of The Web Conference 2020. 2634--2640.
[57]
Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. Autoblock: A hands-off blocking framework for entity matching. In Proceedings of the 13th International Conference on Web Search and Data Mining. 744--752.
[58]
Helong Zhou and Liangchen Song. 2021. Rethinking Soft Labels for Knowledge Distillation: A Bias--Variance Tradeoff Perspective. In Proceedings of International Conference on Learning Representations (ICLR).

Index Terms

  1. Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
      October 2024
      5705 pages
      ISBN:9798400704369
      DOI:10.1145/3627673
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. blocking
      2. data integration
      3. entity matching
      4. entity resolution
      5. large language models
      6. pretrained language models

      Qualifiers

      • Research-article

      Funding Sources

      • the National Natural Science Foundation of China
      • the National Key Research and Development Program of China

      Conference

      CIKM '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 88
        Total Downloads
      • Downloads (Last 12 months)88
      • Downloads (Last 6 weeks)25
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media