Skip to main content

Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14854))

Included in the following conference series:

  • 422 Accesses

Abstract

Entity resolution is a critical problem in data integration. Recently, approaches based on pre-trained language models have shown leading performance and have become the mainstream solution. When facing entities with long-text descriptions, considering that language models have limited input context length, existing approaches tend to use the syntax-based way, e.g., TF-IDF or auxiliary model to highlight the descriptions to be input into the matcher. However, such naive filtering approaches lack the interaction with the matching phase, thus may drop key information for calculating the semantic similarities and affect the final matching quality. To solve the problem of long-text entity resolution, we propose a novel framework called CoTer, which follows a chunk-then-aggregate architecture. CoTer firstly chunks the long-text descriptions to be input into the encoder to get the chunked representations. And then it implicitly highlights the semantically key information in chunked representations by injecting the Chain-of-Thought reasoning knowledge from a Large Language Model. Finally, CoTer fuses the chunked representations and reasoning knowledge in the decoder to output the matching probabilities. Extensive experiments show that CoTer demonstrates leading performance compared with state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

References

  1. Cheng Fu, Xianpei Han, Le Sun 0001, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. End-to-end multi-perspective matching for entity resolution. In IJCAI, pages 4961–4967, 2019.

    Google Scholar 

  2. Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang.Grapher: token-centric entity resolution with graph convolutional neural networks.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8172–8179, 2020.

    Google Scholar 

  3. Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3665–3671, 2021.

    Google Scholar 

  4. Ursin Brunner and Kurt Stockinger. Entity matching with transformer architectures-a step forward in data integration. In 23rd International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020, pages 463–473. OpenProceedings, 2020.

    Google Scholar 

  5. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584, 2020.

  6. Ralph Peeters and Christian Bizer.Dual-objective fine-tuning of bert for entity matching.Proceedings of the VLDB Endowment, 14:1913–1921, 2021.

    Google Scholar 

  7. Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang.Improving the efficiency and effectiveness for bert-based entity resolution.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13226–13233, 2021.

    Google Scholar 

  8. Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, and Ge Yu. Soft target-enhanced matching framework for deep entity matching. 2023.

    Google Scholar 

  9. Liri Fang, Lan Li, Yiren Liu, Vetle I Torvik, and Bertram Ludäscher. Kaer: A knowledge augmented pre-trained language model for entity resolution. arXiv preprint arXiv:2301.04770, 2023.

  10. Jin Wang, Yuliang Li, Wataru Hirota, and Eser Kandogan. Machop: An end-to-end generalized entity matching framework. In Proceedings of the Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pages 1–10, 2022.

    Google Scholar 

  11. Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.

  12. Maor Ivgi, Uri Shaham, and Jonathan Berant.Efficient long-text understanding with short-text models.Transactions of the Association for Computational Linguistics,11:284–299, 2023.

    Google Scholar 

  13. Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.

  14. Zhenghao Liu, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, and Ge Yu. Text matching improves sequential recommendation by reducing popularity biases. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1534–1544, 2023.

    Google Scholar 

  15. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.

    Google Scholar 

  16. Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. Towards open-world recommendation with knowledge augmentation from large language models. arXiv preprint arXiv:2306.10933, 2023.

  17. Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.

  18. Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, and Tamás Sarlós. Optimal hashing schemes for entity matching. In Proceedings of the 22nd international conference on world wide web, pages 295–306, 2013.

    Google Scholar 

  19. Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. Nadeef/er: Generic and interactive entity resolution. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1071–1074, 2014.

    Google Scholar 

  20. Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 11(2):189–202, 2017.

    Google Scholar 

  21. Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How similar is similar. Proceedings of the VLDB Endowment, 4(10):622–633, 2011.

    Google Scholar 

  22. Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 601–612, 2014.

    Google Scholar 

  23. Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. arXiv preprint arXiv:1109.6881, 2011.

  24. Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927, 2012.

  25. Joty MESTS and MON Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11), 2018.

    Google Scholar 

  26. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.

    Google Scholar 

  27. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.

    Google Scholar 

  28. Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749, 2023.

  29. Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R Gormley. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.

  30. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

  31. Jin Wang, Yuliang Li, and Wataru Hirota. Machamp: A generalized entity matching benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 4633–4642, 2021.

    Google Scholar 

  32. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.

    Google Scholar 

  33. Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13):1581–1584, 2016.

    Google Scholar 

Download references

Acknowledgement

This work is supported by National Natural Science Foundation of China (62172082,62072084,62072086), and the Fundamental Research Funds for the Central Universities (N2116008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Derong Shen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, J., Dou, W., Shen, D., Nie, T., Kou, Y. (2024). Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14854. Springer, Singapore. https://doi.org/10.1007/978-981-97-5569-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5569-1_20

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5568-4

  • Online ISBN: 978-981-97-5569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics