Deep entity matching with adversarial active learning

Huang, Jiacheng; Hu, Wei; Bao, Zhifeng; Chen, Qijin; Qu, Yuzhong

doi:10.1007/s00778-022-00745-1

Deep entity matching with adversarial active learning

Regular Paper
Published: 28 April 2022

Volume 32, pages 229–255, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jiacheng Huang¹,
Wei Hu ORCID: orcid.org/0000-0003-3635-6335¹,
Zhifeng Bao²,
Qijin Chen³ &
…
Yuzhong Qu¹

923 Accesses
2 Citations
Explore all metrics

Abstract

Entity matching (EM), as a fundamental task in data cleansing and integration, aims to identify the data records in databases that refer to the same real-world entity. While recent deep learning technologies significantly improve the performance of EM, they are often restrained by large-scale noisy data and insufficient labeled examples. In this paper, we present a novel EM approach based on deep neural networks and adversarial active learning. Specifically, we design a deep EM model to automatically complete missing textual values and capture both similarity and difference between records. Given that learning massive parameters in the deep model needs expensive labeling cost, we propose an adversarial active learning framework, which leverages active learning to collect a small amount of “good” examples and adversarial learning to augment the examples for stability enhancement. Additionally, to deal with large-scale databases, we present a dynamic blocking method that can be interactively tuned with the deep EM model. Our experiments on benchmark datasets demonstrate the superior accuracy of our approach and validate the effectiveness of all the proposed modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework

Mixed Hierarchical Networks for Deep Entity Matching

Article 30 July 2021

Chen-Chen Sun & De-Rong Shen

When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?

Notes

References

Allam, A., Skiadopoulos, S., Kalnis, P.: Improved suffix blocking for record linkage and entity resolution. Data Knowl. Eng. 117, 98–113 (2018)
Article Google Scholar
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794. ACM (2010)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)
Article Google Scholar
Berrendorf, M., Faerman, E., Melnychuk, V., Tresp, V., Seidl, T.: Knowledge graph entity alignment with graph convolutional networks: lessons learned. In: ECIR, pp. 3–11. Springer (2020)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Brunner, U., Stockinger, K.: Entity matching with Transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473. OpenProceedings.org (2020)
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial-order-based framework for cost-effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018)
Article Google Scholar
Das, S., G.C., P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD, pp. 1431–1446. ACM (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. ACL (2019)
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Article Google Scholar
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. In: ACL, pp. 567–573. ACL (2017)
Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. Proc. VLDB Endow. 9(5), 384–395 (2016)
Article Google Scholar
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML, pp. 1183–1192. PMLR (2017)
Getoor, L., Machanavajjhala, A.: Entity resolution: Tutorial. http://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf (2012)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612. ACM (2014)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article Google Scholar
Govind, Y., Konda, P., C., P.S.G., Martinkus, P., Nagarajan, P., Li, H., Soundararajan, A., Mudgal, S., Ballard, J.R., Zhang, H., Ardalan, A., Das, S., Paulsen, D., Saini, A.S., Paulson, E., Park, Y., Carter, M., Sun, M., Fung, G.M., Doan, A.: Entity matching meets data science: a progress report from the Magellan project. In: SIGMOD, pp. 389–403. ACM (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, S.J., Jin, R., Zhou, Z.H.: Active learning by querying informative and representative examples. In: NIPS, pp. 892–900. Curran Associates Inc. (2010)
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)
Article Google Scholar
Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. pp. 1–21 (2021)
Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: ACL, pp. 5851–5861. ACL (2019)
Kenig, B., Gal, A.: MFIBlocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751. ACL (2014)
Kumar, P., Gupta, A.: Active learning query strategies for classification, regression, and clustering: a survey. J. Comput. Sci. Technol. 35(4), 913–945 (2020)
Article Google Scholar
Li, B., Liu, Y., Zhang, A., Wang, W., Wan, S.: A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35(4), 769–793 (2020)
Article Google Scholar
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Article Google Scholar
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, pp. 1064–1074. ACL (2016)
Ma, Y., Tran, T.: TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: WSDM, pp. 325–334. ACM (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119. Curran Associates Inc. (2013)
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34. ACM (2018)
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638. ACM (2019)
Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)
Article Google Scholar
Papadakis, G., Mandilaras, G., Gagliardelli, L., Simonini, G., Thanos, E., Giannakopoulos, G., Bergamaschi, S., Palpanas, T., Koubarakis, M.: Three-dimensional entity resolution with JedAI. Inf. Syst. 93, 101565 (2020)
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543. ACL (2014)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)
Article Google Scholar
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: ICLR (2018)
Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)
Sun, Z., Zhang, Q., Hu, W., Wang, C., Chen, M., Akrami, F., Li, C.: A benchmarking study of embedding-based entity alignment for knowledge graphs. Proc. VLDB Endow. 13(11), 2326–2340 (2020)
Article Google Scholar
Tao, Y.: Entity matching with active monotone classification. In: PODS, pp. 49–62. ACM (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998–6008. Curran Associates Inc. (2017)
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: CorDEL: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327. IEEE (2020)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NIPS, pp. 7333–7343. Curran Associates Inc. (2019)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: AutoBlock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752. ACM (2020)
Zhao, C., He, Y.: Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: WWW, pp. 2413–2424. ACM (2019)
Zhao, X., Zeng, W., Tang, J., Wang, W., Suchanek, F.: An experimental study of state-of-the-art entity alignment approaches. IEEE Trans. Knowl. Data Eng., Early Access (2020)
Zhuang, Y., Li, G., Zhong, Z., Feng, J.: Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: CIKM, pp. 1917–1926. ACM (2017)

Download references

Acknowledgements

This work is supported in part by National Natural Science Foundation of China (No. 61872172), Alibaba Group through Alibaba Research Fellowship Program, and Beijing Academy of Artificial Intelligence (BAAI). Zhifeng Bao is supported in part by ARC DP220101434 and DP200102611.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Jiacheng Huang, Wei Hu & Yuzhong Qu
RMIT University, Melbourne, Australia
Zhifeng Bao
Alibaba Group, Hangzhou, China
Qijin Chen

Authors

Jiacheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhifeng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Qijin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhong Qu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Hu, W., Bao, Z. et al. Deep entity matching with adversarial active learning. The VLDB Journal 32, 229–255 (2023). https://doi.org/10.1007/s00778-022-00745-1

Download citation

Received: 14 July 2021
Revised: 24 December 2021
Accepted: 26 March 2022
Published: 28 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00778-022-00745-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep entity matching with adversarial active learning

Abstract

Access this article

Similar content being viewed by others

SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework

Mixed Hierarchical Networks for Deep Entity Matching

When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep entity matching with adversarial active learning

Abstract

Access this article

Similar content being viewed by others

SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework

Mixed Hierarchical Networks for Deep Entity Matching

When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation