Skip to main content

SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework

  • Conference paper
  • First Online:
Web Information Systems and Applications (WISA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13579))

Included in the following conference series:

  • 993 Accesses

Abstract

Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework for heterogeneous data. We first obtain effective feature vectors using an embedding approach that combines semantic and relational information, and this approach can be used for long sequences. Deep learning requires much-labeled data, which is very costly and time-consuming. In this paper, we address the problem by using a dropout layer for data augmentation and propose an active learning method that is more suitable for entity matching. We also address the classical challenges of deep active learning by reducing human intervention and improving model performance. We experiment with six public benchmark datasets, and the results clearly show that our method outperforms DeepER and DeepMatcher on all datasets. Our method can achieve comparable effectiveness to SOTA entity matching methods with a smaller amount of data, achieve the goal of cost reduction, and outperform SOTA entity matching methods on large datasets with long sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications Description, pp. 1–270. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

  2. Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: PVLDB 2007, pp. 327–338 (2007)

    Google Scholar 

  3. Christen, P.: The data matching process. In: Christen, P. (ed.) Data Matching, pp. 23–35. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2_2

  4. Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)

    Article  MATH  Google Scholar 

  5. Mudgal, S., et al.: Deep Learning for Entity Matching: A Design Space Exploration. (2018)

    Google Scholar 

  6. Li, Y., Li, J., Suhara, Y., et al.: Deep entity matching with pre-trained language models. PVLDB. 14(1), 1–7 (2021)

    Google Scholar 

  7. Ebraheem, M., Thirumuruganathan, S., Joty, S., et al.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)

    Article  Google Scholar 

  8. Mudgal, S., Li, H., Rekatsinas, T., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. ACM (2018)

    Google Scholar 

  9. Kooli, N., Allesiardo, R., Pigneul, E.: Deep learning based approach for entity resolution in databases. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) Intelligent Information and Database Systems. LNCS (LNAI), vol. 10752, pp. 3–12. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_1

    Chapter  Google Scholar 

  10. Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference (2019)

    Google Scholar 

  11. Fu, C., Han, X., Sun, L., et al.: End-to-end multi-perspective matching for entity resolution. In: Twenty-Eighth International Joint Conference on Artificial Intelligence (2019)

    Google Scholar 

  12. Nie, H., Han, X., He, B., et al.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 629–638 (2019)

    Google Scholar 

  13. Zhang, D., Nie, Y., Wu, S., et al.: Multi-context attention for entity matching. In: Association for Computing Machinery (2020)

    Google Scholar 

  14. Teong, K.-S., Soon, L.-K., et al.: Schema-agnostic entity matching using pre-trained language models. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020)

    Google Scholar 

  15. Kooli, N.: Data Matching for Entity Recognition in OCRed Documents. Lorraine University, Thesis, Defense (2016)

    Google Scholar 

  16. Azzalini, F., Jin, S., Renzi, M., Tanca, L.: Blocking techniques for entity linkage: a semantics-based approach. Data Sci. Eng. 6(1), 20–38 (2020). https://doi.org/10.1007/s41019-020-00146-w

    Article  Google Scholar 

  17. Li, B., Wang, W., Sun, Y., et al.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: AAAI (2020)

    Google Scholar 

  18. Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference onManagement of Data (SIGMOD 2020), 14–19 June 2020, Portland, OR, USA, 16 p. ACM, New York (2020)

    Google Scholar 

  19. Settles, B.: Active Learning Literature Survey. Technical report. University of Wisconsin-Madison Department of Computer Sciences (2009)

    Google Scholar 

  20. Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. 54(9), 40 (2021)

    Google Scholar 

  21. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE. 19(1), 1–9 (2007)

    Google Scholar 

  22. Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1272–1283 (2021)

    Google Scholar 

  23. Kasai, J., Qian, K., Gurajada, S., et al.: Low-resource deep entity resolution with transfer and active learning. Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  24. Vamsi Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat.: A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching[C]. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020

    Google Scholar 

  25. Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021)

    Google Scholar 

  26. Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. PVLDB. 15(1), 31–45 (2022)

    Google Scholar 

  27. Xu, W., Sun, C., Xu, L., Chen, W., Hou, Z.: Unsupervised entity resolution method based on random forest. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds.) Web Information Systems and Applications. LNCS, vol. 12999, pp. 372–382. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87571-8_32

    Chapter  Google Scholar 

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (62072086, 62172082, 62072084), the Fundamental Research Funds for the central Universities (N2116008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinxiu Du .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Du, J., Nie, T., Dou, W., Shen, D., Kou, Y. (2022). SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20309-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20308-4

  • Online ISBN: 978-3-031-20309-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics