Abstract
Entity matching (EM), as a fundamental task in data cleansing and integration, aims to identify the data records in databases that refer to the same real-world entity. While recent deep learning technologies significantly improve the performance of EM, they are often restrained by large-scale noisy data and insufficient labeled examples. In this paper, we present a novel EM approach based on deep neural networks and adversarial active learning. Specifically, we design a deep EM model to automatically complete missing textual values and capture both similarity and difference between records. Given that learning massive parameters in the deep model needs expensive labeling cost, we propose an adversarial active learning framework, which leverages active learning to collect a small amount of “good” examples and adversarial learning to augment the examples for stability enhancement. Additionally, to deal with large-scale databases, we present a dynamic blocking method that can be interactively tuned with the deep EM model. Our experiments on benchmark datasets demonstrate the superior accuracy of our approach and validate the effectiveness of all the proposed modules.





Similar content being viewed by others
References
Allam, A., Skiadopoulos, S., Kalnis, P.: Improved suffix blocking for record linkage and entity resolution. Data Knowl. Eng. 117, 98–113 (2018)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794. ACM (2010)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)
Berrendorf, M., Faerman, E., Melnychuk, V., Tresp, V., Seidl, T.: Knowledge graph entity alignment with graph convolutional networks: lessons learned. In: ECIR, pp. 3–11. Springer (2020)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Brunner, U., Stockinger, K.: Entity matching with Transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473. OpenProceedings.org (2020)
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial-order-based framework for cost-effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018)
Das, S., G.C., P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD, pp. 1431–1446. ACM (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. ACL (2019)
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. In: ACL, pp. 567–573. ACL (2017)
Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. Proc. VLDB Endow. 9(5), 384–395 (2016)
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML, pp. 1183–1192. PMLR (2017)
Getoor, L., Machanavajjhala, A.: Entity resolution: Tutorial. http://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf (2012)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612. ACM (2014)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Govind, Y., Konda, P., C., P.S.G., Martinkus, P., Nagarajan, P., Li, H., Soundararajan, A., Mudgal, S., Ballard, J.R., Zhang, H., Ardalan, A., Das, S., Paulsen, D., Saini, A.S., Paulson, E., Park, Y., Carter, M., Sun, M., Fung, G.M., Doan, A.: Entity matching meets data science: a progress report from the Magellan project. In: SIGMOD, pp. 389–403. ACM (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, S.J., Jin, R., Zhou, Z.H.: Active learning by querying informative and representative examples. In: NIPS, pp. 892–900. Curran Associates Inc. (2010)
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)
Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. pp. 1–21 (2021)
Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: ACL, pp. 5851–5861. ACL (2019)
Kenig, B., Gal, A.: MFIBlocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751. ACL (2014)
Kumar, P., Gupta, A.: Active learning query strategies for classification, regression, and clustering: a survey. J. Comput. Sci. Technol. 35(4), 913–945 (2020)
Li, B., Liu, Y., Zhang, A., Wang, W., Wan, S.: A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35(4), 769–793 (2020)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, pp. 1064–1074. ACL (2016)
Ma, Y., Tran, T.: TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: WSDM, pp. 325–334. ACM (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119. Curran Associates Inc. (2013)
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34. ACM (2018)
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638. ACM (2019)
Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)
Papadakis, G., Mandilaras, G., Gagliardelli, L., Simonini, G., Thanos, E., Giannakopoulos, G., Bergamaschi, S., Palpanas, T., Koubarakis, M.: Three-dimensional entity resolution with JedAI. Inf. Syst. 93, 101565 (2020)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543. ACL (2014)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: ICLR (2018)
Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)
Sun, Z., Zhang, Q., Hu, W., Wang, C., Chen, M., Akrami, F., Li, C.: A benchmarking study of embedding-based entity alignment for knowledge graphs. Proc. VLDB Endow. 13(11), 2326–2340 (2020)
Tao, Y.: Entity matching with active monotone classification. In: PODS, pp. 49–62. ACM (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998–6008. Curran Associates Inc. (2017)
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: CorDEL: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327. IEEE (2020)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NIPS, pp. 7333–7343. Curran Associates Inc. (2019)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: AutoBlock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752. ACM (2020)
Zhao, C., He, Y.: Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: WWW, pp. 2413–2424. ACM (2019)
Zhao, X., Zeng, W., Tang, J., Wang, W., Suchanek, F.: An experimental study of state-of-the-art entity alignment approaches. IEEE Trans. Knowl. Data Eng., Early Access (2020)
Zhuang, Y., Li, G., Zhong, Z., Feng, J.: Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: CIKM, pp. 1917–1926. ACM (2017)
Acknowledgements
This work is supported in part by National Natural Science Foundation of China (No. 61872172), Alibaba Group through Alibaba Research Fellowship Program, and Beijing Academy of Artificial Intelligence (BAAI). Zhifeng Bao is supported in part by ARC DP220101434 and DP200102611.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, J., Hu, W., Bao, Z. et al. Deep entity matching with adversarial active learning. The VLDB Journal 32, 229–255 (2023). https://doi.org/10.1007/s00778-022-00745-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00745-1