SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework

Du, Jinxiu; Nie, Tiezheng; Dou, Wenzhou; Shen, Derong; Kou, Yue

doi:10.1007/978-3-031-20309-1_7

Jinxiu Du¹¹,
Tiezheng Nie¹¹,
Wenzhou Dou¹¹,
Derong Shen¹¹ &
…
Yue Kou¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13579))

Included in the following conference series:

International Conference on Web Information Systems and Applications

993 Accesses

Abstract

Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework for heterogeneous data. We first obtain effective feature vectors using an embedding approach that combines semantic and relational information, and this approach can be used for long sequences. Deep learning requires much-labeled data, which is very costly and time-consuming. In this paper, we address the problem by using a dropout layer for data augmentation and propose an active learning method that is more suitable for entity matching. We also address the classical challenges of deep active learning by reducing human intervention and improving model performance. We experiment with six public benchmark datasets, and the results clearly show that our method outperforms DeepER and DeepMatcher on all datasets. Our method can achieve comparable effectiveness to SOTA entity matching methods with a smaller amount of data, achieve the goal of cost reduction, and outperform SOTA entity matching methods on large datasets with long sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications Description, pp. 1–270. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: PVLDB 2007, pp. 327–338 (2007)
Google Scholar
Christen, P.: The data matching process. In: Christen, P. (ed.) Data Matching, pp. 23–35. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2_2
Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Article MATH Google Scholar
Mudgal, S., et al.: Deep Learning for Entity Matching: A Design Space Exploration. (2018)
Google Scholar
Li, Y., Li, J., Suhara, Y., et al.: Deep entity matching with pre-trained language models. PVLDB. 14(1), 1–7 (2021)
Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S., et al.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Article Google Scholar
Mudgal, S., Li, H., Rekatsinas, T., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. ACM (2018)
Google Scholar
Kooli, N., Allesiardo, R., Pigneul, E.: Deep learning based approach for entity resolution in databases. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) Intelligent Information and Database Systems. LNCS (LNAI), vol. 10752, pp. 3–12. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_1
Chapter Google Scholar
Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference (2019)
Google Scholar
Fu, C., Han, X., Sun, L., et al.: End-to-end multi-perspective matching for entity resolution. In: Twenty-Eighth International Joint Conference on Artificial Intelligence (2019)
Google Scholar
Nie, H., Han, X., He, B., et al.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 629–638 (2019)
Google Scholar
Zhang, D., Nie, Y., Wu, S., et al.: Multi-context attention for entity matching. In: Association for Computing Machinery (2020)
Google Scholar
Teong, K.-S., Soon, L.-K., et al.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020)
Google Scholar
Kooli, N.: Data Matching for Entity Recognition in OCRed Documents. Lorraine University, Thesis, Defense (2016)
Google Scholar
Azzalini, F., Jin, S., Renzi, M., Tanca, L.: Blocking techniques for entity linkage: a semantics-based approach. Data Sci. Eng. 6(1), 20–38 (2020). https://doi.org/10.1007/s41019-020-00146-w
Article Google Scholar
Li, B., Wang, W., Sun, Y., et al.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: AAAI (2020)
Google Scholar
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference onManagement of Data (SIGMOD 2020), 14–19 June 2020, Portland, OR, USA, 16 p. ACM, New York (2020)
Google Scholar
Settles, B.: Active Learning Literature Survey. Technical report. University of Wisconsin-Madison Department of Computer Sciences (2009)
Google Scholar
Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. 54(9), 40 (2021)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE. 19(1), 1–9 (2007)
Google Scholar
Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1272–1283 (2021)
Google Scholar
Kasai, J., Qian, K., Gurajada, S., et al.: Low-resource deep entity resolution with transfer and active learning. Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Vamsi Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat.: A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching[C]. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020
Google Scholar
Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021)
Google Scholar
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. PVLDB. 15(1), 31–45 (2022)
Google Scholar
Xu, W., Sun, C., Xu, L., Chen, W., Hou, Z.: Unsupervised entity resolution method based on random forest. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds.) Web Information Systems and Applications. LNCS, vol. 12999, pp. 372–382. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87571-8_32
Chapter Google Scholar

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (62072086, 62172082, 62072084), the Fundamental Research Funds for the central Universities (N2116008).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, 110169, China
Jinxiu Du, Tiezheng Nie, Wenzhou Dou, Derong Shen & Yue Kou

Authors

Jinxiu Du
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhou Dou
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinxiu Du .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Xiang Zhao
Guangzhou University, Guangzhou, China
Shiyu Yang
Tianjin University, Tianjin, China
Xin Wang
Deakin University, Melbourne, VIC, Australia
Jianxin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, J., Nie, T., Dou, W., Shen, D., Kou, Y. (2022). SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-20309-1_7
Published: 08 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics