Skip to main content
Log in

Deep entity matching with adversarial active learning

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Entity matching (EM), as a fundamental task in data cleansing and integration, aims to identify the data records in databases that refer to the same real-world entity. While recent deep learning technologies significantly improve the performance of EM, they are often restrained by large-scale noisy data and insufficient labeled examples. In this paper, we present a novel EM approach based on deep neural networks and adversarial active learning. Specifically, we design a deep EM model to automatically complete missing textual values and capture both similarity and difference between records. Given that learning massive parameters in the deep model needs expensive labeling cost, we propose an adversarial active learning framework, which leverages active learning to collect a small amount of “good” examples and adversarial learning to augment the examples for stability enhancement. Additionally, to deal with large-scale databases, we present a dynamic blocking method that can be interactively tuned with the deep EM model. Our experiments on benchmark datasets demonstrate the superior accuracy of our approach and validate the effectiveness of all the proposed modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/nju-websoft/DAEM.

  2. https://github.com/facebookresearch/faiss.

  3. https://fasttext.cc/docs/en/pretrained-vectors.html.

References

  1. Allam, A., Skiadopoulos, S., Kalnis, P.: Improved suffix blocking for record linkage and entity resolution. Data Knowl. Eng. 117, 98–113 (2018)

    Article  Google Scholar 

  2. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794. ACM (2010)

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

  4. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)

    Article  Google Scholar 

  5. Berrendorf, M., Faerman, E., Melnychuk, V., Tresp, V., Seidl, T.: Knowledge graph entity alignment with graph convolutional networks: lessons learned. In: ECIR, pp. 3–11. Springer (2020)

  6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  7. Brunner, U., Stockinger, K.: Entity matching with Transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473. OpenProceedings.org (2020)

  8. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial-order-based framework for cost-effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018)

    Article  Google Scholar 

  9. Das, S., G.C., P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD, pp. 1431–1446. ACM (2017)

  10. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. ACL (2019)

  11. Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)

    Article  Google Scholar 

  12. Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. In: ACL, pp. 567–573. ACL (2017)

  13. Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. Proc. VLDB Endow. 9(5), 384–395 (2016)

    Article  Google Scholar 

  14. Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML, pp. 1183–1192. PMLR (2017)

  15. Getoor, L., Machanavajjhala, A.: Entity resolution: Tutorial. http://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf (2012)

  16. Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612. ACM (2014)

  17. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  Google Scholar 

  18. Govind, Y., Konda, P., C., P.S.G., Martinkus, P., Nagarajan, P., Li, H., Soundararajan, A., Mudgal, S., Ballard, J.R., Zhang, H., Ardalan, A., Das, S., Paulsen, D., Saini, A.S., Paulson, E., Park, Y., Carter, M., Sun, M., Fung, G.M., Doan, A.: Entity matching meets data science: a progress report from the Magellan project. In: SIGMOD, pp. 389–403. ACM (2019)

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE (2016)

  20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  21. Huang, S.J., Jin, R., Zhou, Z.H.: Active learning by querying informative and representative examples. In: NIPS, pp. 892–900. Curran Associates Inc. (2010)

  22. Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)

    Article  Google Scholar 

  23. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. pp. 1–21 (2021)

  24. Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: ACL, pp. 5851–5861. ACL (2019)

  25. Kenig, B., Gal, A.: MFIBlocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)

    Article  Google Scholar 

  26. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751. ACL (2014)

  27. Kumar, P., Gupta, A.: Active learning query strategies for classification, regression, and clustering: a survey. J. Comput. Sci. Technol. 35(4), 913–945 (2020)

    Article  Google Scholar 

  28. Li, B., Liu, Y., Zhang, A., Wang, W., Wan, S.: A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35(4), 769–793 (2020)

    Article  Google Scholar 

  29. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)

    Article  Google Scholar 

  30. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, pp. 1064–1074. ACL (2016)

  31. Ma, Y., Tran, T.: TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: WSDM, pp. 325–334. ACM (2013)

  32. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119. Curran Associates Inc. (2013)

  33. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34. ACM (2018)

  34. Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638. ACM (2019)

  35. Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)

    Article  Google Scholar 

  36. Papadakis, G., Mandilaras, G., Gagliardelli, L., Simonini, G., Thanos, E., Giannakopoulos, G., Bergamaschi, S., Palpanas, T., Koubarakis, M.: Three-dimensional entity resolution with JedAI. Inf. Syst. 93, 101565 (2020)

    Article  Google Scholar 

  37. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)

    Article  Google Scholar 

  38. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543. ACL (2014)

  39. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)

    Article  Google Scholar 

  40. Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: ICLR (2018)

  41. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)

  42. Sun, Z., Zhang, Q., Hu, W., Wang, C., Chen, M., Akrami, F., Li, C.: A benchmarking study of embedding-based entity alignment for knowledge graphs. Proc. VLDB Endow. 13(11), 2326–2340 (2020)

    Article  Google Scholar 

  43. Tao, Y.: Entity matching with active monotone classification. In: PODS, pp. 49–62. ACM (2018)

  44. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998–6008. Curran Associates Inc. (2017)

  45. Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: CorDEL: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327. IEEE (2020)

  46. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NIPS, pp. 7333–7343. Curran Associates Inc. (2019)

  47. Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: AutoBlock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752. ACM (2020)

  48. Zhao, C., He, Y.: Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: WWW, pp. 2413–2424. ACM (2019)

  49. Zhao, X., Zeng, W., Tang, J., Wang, W., Suchanek, F.: An experimental study of state-of-the-art entity alignment approaches. IEEE Trans. Knowl. Data Eng., Early Access (2020)

  50. Zhuang, Y., Li, G., Zhong, Z., Feng, J.: Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: CIKM, pp. 1917–1926. ACM (2017)

Download references

Acknowledgements

This work is supported in part by National Natural Science Foundation of China (No. 61872172), Alibaba Group through Alibaba Research Fellowship Program, and Beijing Academy of Artificial Intelligence (BAAI). Zhifeng Bao is supported in part by ARC DP220101434 and DP200102611.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, J., Hu, W., Bao, Z. et al. Deep entity matching with adversarial active learning. The VLDB Journal 32, 229–255 (2023). https://doi.org/10.1007/s00778-022-00745-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00745-1

Keywords

Navigation