Abstract
Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework for heterogeneous data. We first obtain effective feature vectors using an embedding approach that combines semantic and relational information, and this approach can be used for long sequences. Deep learning requires much-labeled data, which is very costly and time-consuming. In this paper, we address the problem by using a dropout layer for data augmentation and propose an active learning method that is more suitable for entity matching. We also address the classical challenges of deep active learning by reducing human intervention and improving model performance. We experiment with six public benchmark datasets, and the results clearly show that our method outperforms DeepER and DeepMatcher on all datasets. Our method can achieve comparable effectiveness to SOTA entity matching methods with a smaller amount of data, achieve the goal of cost reduction, and outperform SOTA entity matching methods on large datasets with long sequences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications Description, pp. 1–270. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: PVLDB 2007, pp. 327–338 (2007)
Christen, P.: The data matching process. In: Christen, P. (ed.) Data Matching, pp. 23–35. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2_2
Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Mudgal, S., et al.: Deep Learning for Entity Matching: A Design Space Exploration. (2018)
Li, Y., Li, J., Suhara, Y., et al.: Deep entity matching with pre-trained language models. PVLDB. 14(1), 1–7 (2021)
Ebraheem, M., Thirumuruganathan, S., Joty, S., et al.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Mudgal, S., Li, H., Rekatsinas, T., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. ACM (2018)
Kooli, N., Allesiardo, R., Pigneul, E.: Deep learning based approach for entity resolution in databases. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) Intelligent Information and Database Systems. LNCS (LNAI), vol. 10752, pp. 3–12. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_1
Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference (2019)
Fu, C., Han, X., Sun, L., et al.: End-to-end multi-perspective matching for entity resolution. In: Twenty-Eighth International Joint Conference on Artificial Intelligence (2019)
Nie, H., Han, X., He, B., et al.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 629–638 (2019)
Zhang, D., Nie, Y., Wu, S., et al.: Multi-context attention for entity matching. In: Association for Computing Machinery (2020)
Teong, K.-S., Soon, L.-K., et al.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020)
Kooli, N.: Data Matching for Entity Recognition in OCRed Documents. Lorraine University, Thesis, Defense (2016)
Azzalini, F., Jin, S., Renzi, M., Tanca, L.: Blocking techniques for entity linkage: a semantics-based approach. Data Sci. Eng. 6(1), 20–38 (2020). https://doi.org/10.1007/s41019-020-00146-w
Li, B., Wang, W., Sun, Y., et al.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: AAAI (2020)
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference onManagement of Data (SIGMOD 2020), 14–19 June 2020, Portland, OR, USA, 16 p. ACM, New York (2020)
Settles, B.: Active Learning Literature Survey. Technical report. University of Wisconsin-Madison Department of Computer Sciences (2009)
Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. 54(9), 40 (2021)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE. 19(1), 1–9 (2007)
Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1272–1283 (2021)
Kasai, J., Qian, K., Gurajada, S., et al.: Low-resource deep entity resolution with transfer and active learning. Meeting of the Association for Computational Linguistics (2019)
Vamsi Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat.: A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching[C]. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020
Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021)
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. PVLDB. 15(1), 31–45 (2022)
Xu, W., Sun, C., Xu, L., Chen, W., Hou, Z.: Unsupervised entity resolution method based on random forest. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds.) Web Information Systems and Applications. LNCS, vol. 12999, pp. 372–382. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87571-8_32
Acknowledgment
This work was supported by the National Natural Science Foundation of China (62072086, 62172082, 62072084), the Fundamental Research Funds for the central Universities (N2116008).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Du, J., Nie, T., Dou, W., Shen, D., Kou, Y. (2022). SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-20309-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)