Skip to main content

Entity Resolution Based on Pre-trained Language Models with Two Attentions

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

Entity Resolution (ER) is one of the most important issues for improving data quality, which aims to identify the records from one and more datasets that refer to the same real-world entity. For the textual datasets with the attribute values of long word sequences, the traditional methods of ER may fail to capture accurately the semantic information of records, leading to poor effectiveness. To address this challenging problem, in this paper, by using pre-trained language model RoBERTa and by fine-tuning it in the training process, we propose a novel entity resolution model IGaBERT, in which interactive attention is applied to capture token-level differences between records and to break the restriction that the schema required identically, and then global attention is utilized to determine the importance of these differences. Extensive experiments without injecting domain knowledge are conducted to measure the effectiveness of the IGaBERT model over both structured datasets and textual datasets. The results indicate that IGaBERT significantly outperforms several state-of-the-art approaches over textual datasets, especially with small size of training data, and it is highly competitive with those approaches over structured datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 1–42 (2020)

    Article  Google Scholar 

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

  3. Dunn, H.L.: Record linkage. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)

    Article  Google Scholar 

  4. Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endowment 11(11), 1454–1467 (2018)

    Article  Google Scholar 

  5. Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3665–3671 (2021)

    Google Scholar 

  6. Gal, A.: Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial. Proc. VLDB Endowment 7(13), 1711–1712 (2014)

    Article  Google Scholar 

  7. Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1527 (2013)

    Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Kejriwal, M.: Entity resolution in a big data framework. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4243–4244 (2015)

    Google Scholar 

  10. Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. Proc. VLDB Endowment 9(13), 1581–1584 (2016). https://doi.org/10.14778/3007263.3007314

    Article  Google Scholar 

  11. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. Proc. VLDB Endowment 14(1), 50–60 (2020)

    Article  Google Scholar 

  12. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)

  13. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)

    Google Scholar 

  14. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)

    Google Scholar 

  15. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

    Google Scholar 

  16. Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference, pp. 381–386 (2019)

    Google Scholar 

  17. Ribeiro, M.T., Singh, S., Guestrin, C.: “ Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  18. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)

    Google Scholar 

  19. Singh, R., et al.: Synthesizing entity matching rules by examples. Proc. VLDB Endowment 11(2), 189–202 (2017). https://doi.org/10.14778/3149193.3149199

    Article  MathSciNet  Google Scholar 

  20. Teong, K.S., Soon, L.K., Su, T.T.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2241–2244 (2020)

    Google Scholar 

  21. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  22. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)

    Article  Google Scholar 

  23. Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)

    Article  Google Scholar 

  24. Wolf, T., et al.: Huggingface's transformers: State-of-the-art natural language processing. arXiv:1910.03771 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, L., Liu, H., Song, X., Wei, Y., Wang, Y. (2024). Entity Resolution Based on Pre-trained Language Models with Two Attentions. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2387-4_29

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2386-7

  • Online ISBN: 978-981-97-2387-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics