Abstract
Entity Resolution (ER) is one of the most important issues for improving data quality, which aims to identify the records from one and more datasets that refer to the same real-world entity. For the textual datasets with the attribute values of long word sequences, the traditional methods of ER may fail to capture accurately the semantic information of records, leading to poor effectiveness. To address this challenging problem, in this paper, by using pre-trained language model RoBERTa and by fine-tuning it in the training process, we propose a novel entity resolution model IGaBERT, in which interactive attention is applied to capture token-level differences between records and to break the restriction that the schema required identically, and then global attention is utilized to determine the importance of these differences. Extensive experiments without injecting domain knowledge are conducted to measure the effectiveness of the IGaBERT model over both structured datasets and textual datasets. The results indicate that IGaBERT significantly outperforms several state-of-the-art approaches over textual datasets, especially with small size of training data, and it is highly competitive with those approaches over structured datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 1–42 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Dunn, H.L.: Record linkage. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endowment 11(11), 1454–1467 (2018)
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3665–3671 (2021)
Gal, A.: Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial. Proc. VLDB Endowment 7(13), 1711–1712 (2014)
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1527 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kejriwal, M.: Entity resolution in a big data framework. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4243–4244 (2015)
Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. Proc. VLDB Endowment 9(13), 1581–1584 (2016). https://doi.org/10.14778/3007263.3007314
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. Proc. VLDB Endowment 14(1), 50–60 (2020)
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference, pp. 381–386 (2019)
Ribeiro, M.T., Singh, S., Guestrin, C.: “ Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Singh, R., et al.: Synthesizing entity matching rules by examples. Proc. VLDB Endowment 11(2), 189–202 (2017). https://doi.org/10.14778/3149193.3149199
Teong, K.S., Soon, L.K., Su, T.T.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2241–2244 (2020)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)
Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)
Wolf, T., et al.: Huggingface's transformers: State-of-the-art natural language processing. arXiv:1910.03771 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhu, L., Liu, H., Song, X., Wei, Y., Wang, Y. (2024). Entity Resolution Based on Pre-trained Language Models with Two Attentions. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_29
Download citation
DOI: https://doi.org/10.1007/978-981-97-2387-4_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2386-7
Online ISBN: 978-981-97-2387-4
eBook Packages: Computer ScienceComputer Science (R0)