Entity Resolution Based on Pre-trained Language Models with Two Attentions

Zhu, Liang; Liu, Hao; Song, Xin; Wei, Yonggang; Wang, Yu

doi:10.1007/978-981-97-2387-4_29

Liang Zhu¹²,
Hao Liu¹²,
Xin Song¹²,
Yonggang Wei¹² &
…
Yu Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14333))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

101 Accesses

Abstract

Entity Resolution (ER) is one of the most important issues for improving data quality, which aims to identify the records from one and more datasets that refer to the same real-world entity. For the textual datasets with the attribute values of long word sequences, the traditional methods of ER may fail to capture accurately the semantic information of records, leading to poor effectiveness. To address this challenging problem, in this paper, by using pre-trained language model RoBERTa and by fine-tuning it in the training process, we propose a novel entity resolution model IGaBERT, in which interactive attention is applied to capture token-level differences between records and to break the restriction that the schema required identically, and then global attention is utilized to determine the importance of these differences. Extensive experiments without injecting domain knowledge are conducted to measure the effectiveness of the IGaBERT model over both structured datasets and textual datasets. The results indicate that IGaBERT significantly outperforms several state-of-the-art approaches over textual datasets, especially with small size of training data, and it is highly competitive with those approaches over structured datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 1–42 (2020)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Dunn, H.L.: Record linkage. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)
Article Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endowment 11(11), 1454–1467 (2018)
Article Google Scholar
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3665–3671 (2021)
Google Scholar
Gal, A.: Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial. Proc. VLDB Endowment 7(13), 1711–1712 (2014)
Article Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1527 (2013)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kejriwal, M.: Entity resolution in a big data framework. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4243–4244 (2015)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. Proc. VLDB Endowment 9(13), 1581–1584 (2016). https://doi.org/10.14778/3007263.3007314
Article Google Scholar
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. Proc. VLDB Endowment 14(1), 50–60 (2020)
Article Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)
Google Scholar
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference, pp. 381–386 (2019)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “ Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Google Scholar
Singh, R., et al.: Synthesizing entity matching rules by examples. Proc. VLDB Endowment 11(2), 189–202 (2017). https://doi.org/10.14778/3149193.3149199
Article MathSciNet Google Scholar
Teong, K.S., Soon, L.K., Su, T.T.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2241–2244 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)
Article Google Scholar
Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)
Article Google Scholar
Wolf, T., et al.: Huggingface's transformers: State-of-the-art natural language processing. arXiv:1910.03771 (2019)

Download references

Author information

Authors and Affiliations

Hebei University, Baoding, 071002, Hebei, China
Liang Zhu, Hao Liu, Xin Song, Yonggang Wei & Yu Wang

Authors

Liang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Song
View author publications
You can also search for this author in PubMed Google Scholar
Yonggang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Wang .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, L., Liu, H., Song, X., Wei, Y., Wang, Y. (2024). Entity Resolution Based on Pre-trained Language Models with Two Attentions. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_29

Download citation

DOI: https://doi.org/10.1007/978-981-97-2387-4_29
Published: 28 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2386-7
Online ISBN: 978-981-97-2387-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics