skip to main content
10.1145/3183713.3196926acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Deep Learning for Entity Matching: A Design Space Exploration

Published:27 May 2018Publication History

ABSTRACT

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

References

  1. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR.Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.Google ScholarGoogle Scholar
  3. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, et almbox.. 2016. End-to-end attention-based large vocabulary speech recognition. IEEE ICASSP.Google ScholarGoogle Scholar
  4. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR (March. 2003), 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Piotr Bojanowski, Edouard Grave, Armand Joulin, et almbox.. 2016. Enriching Word Vectors with Subword Information. CoRR Vol. abs/1607.04606 (2016).Google ScholarGoogle Scholar
  7. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, et almbox.. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. CoRR Vol. abs/1708.01353 (2017).Google ScholarGoogle Scholar
  8. Kyunghyun Cho et almbox.. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.Google ScholarGoogle Scholar
  9. Peter Christen. 2012. Data Matching. Springer.Google ScholarGoogle Scholar
  10. Kevin Clark et almbox.. 2016. Improving coreference resolution by learning entity-level distributed representations. CoRR Vol. abs/1606.01323 (2016).Google ScholarGoogle Scholar
  11. William W. Cohen. 2016. TensorLog: A Differentiable Deductive Database. CoRR Vol. abs/1605.06523 (2016).Google ScholarGoogle Scholar
  12. Ronan Collobert et almbox.. 2011 a. Natural language processing (almost) from scratch. JMLR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011 b. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.Google ScholarGoogle Scholar
  14. Ido Dagan, Dan Roth, Fabio Zanzotto, and Graeme Hirst. 2012. Recognizing Textual Entailment. Morgan &Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sanjib Das et almbox.. {n. d.}. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data. (.{n. d.}).Google ScholarGoogle Scholar
  16. Bhuwan Dhingra, Hanxiao Liu, et almbox.. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR Vol. abs/1703.00993 (2017).Google ScholarGoogle Scholar
  17. Jens Dittrich. 2017. Deep Learning (m)eats Databases. VLDB Keynote.Google ScholarGoogle Scholar
  18. Muhammad Ebraheem, Saravanan Thirumuruganathan, et almbox.. 2017. DeepER--Deep Entity Resolution. CoRR Vol. abs/1710.00597 (2017).Google ScholarGoogle Scholar
  19. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE Vol. 19, 1 (Jan. . 2007), 1--16. nMaarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Massimo Nicosia and Alessandro Moschitti. 2017. Accurate Sentence Matching with Hybrid Siamese Networks. CIKM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. George Papadakis, Jonathan Svirsky, Avigdor Gal, et almbox.. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. EMNLP.Google ScholarGoogle Scholar
  23. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. EMNLP.Google ScholarGoogle Scholar
  24. Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, et almbox.. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks Vol. 61 (2015), 85--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, and Erhard Rahm. 2015. Privacy Preserving Record Linkage with PPJoin. BTW.Google ScholarGoogle Scholar
  27. Uri Shaham, Xiuyuan Cheng, Omer Dror, et almbox.. 2016. A Deep Learning Approach to Unsupervised Ensemble Learning. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tao Shen et almbox.. 2017. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding. CoRR Vol. abs/1709.04696 (2017).Google ScholarGoogle Scholar
  29. Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE Vol. 27, 2 (2015), 443--460.Google ScholarGoogle ScholarCross RefCross Ref
  30. Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, et almbox.. 2017. Generating Concise Entity Matching Rules. SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Parag Singla et almbox.. 2006. Entity Resolution with Markov Logic. ICDM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Richard Socher et almbox.. 2013 a. Parsing with compositional vector grammars. ACL.Google ScholarGoogle Scholar
  33. Richard Socher et almbox.. 2013 b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.Google ScholarGoogle Scholar
  34. Rupesh Kumar Srivastava et almbox.. 2015. Highway networks. ICML.Google ScholarGoogle Scholar
  35. Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, et almbox.. 2013. Data Curation at Scale: The Data Tamer System. CIDR.Google ScholarGoogle Scholar
  36. Hendrik Strobelt et almbox.. 2016. Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. CoRR abs/1606.07461 (2016).Google ScholarGoogle Scholar
  37. Yaming Sun, Lei Lin, Duyu Tang, et almbox.. 2015. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation. IJCAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ilya Sutskever. 2013. Training recurrent neural networks. Ph.D. Dissertation. bibinfoschoolUniversity of Toronto. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ming Tan et almbox.. 2016. Improved Representation Learning for Question Answer Matching. ACL.Google ScholarGoogle Scholar
  41. Ashish Vaswani et almbox.. 2017. Attention Is All You Need. NIPS.Google ScholarGoogle Scholar
  42. Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et almbox.. 2016. Matching networks for one shot learning. ACL.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching Text Sequences. ICLR.Google ScholarGoogle Scholar
  45. Wei Wang et almbox.. 2016. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record Vol. 45, 2 (2016), 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning Global Features for Coreference Resolution. NAACL.Google ScholarGoogle Scholar
  47. Sen Wu, Luke Hsiao, Xiao Cheng, et almbox.. 2017. Fonduer: Knowledge Base Construction from Richly Formatted Data. CoRR Vol. abs/1703.05028 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wenpeng Yin et almbox.. 2016 a. Simple Question Answering by Attentive Convolutional Neural Network. COLING.Google ScholarGoogle Scholar
  49. Wenpeng Yin, Mo Yu, Bing Xiang, et almbox.. 2016 b. Simple question answering by attentive convolutional neural network. CoRR Vol. abs/1606.03391 (2016).Google ScholarGoogle Scholar
  50. Radu Florian Zhiguo Wang, Wael Hamza. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. IJCAI.Google ScholarGoogle Scholar

Index Terms

  1. Deep Learning for Entity Matching: A Design Space Exploration

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
        May 2018
        1874 pages
        ISBN:9781450347037
        DOI:10.1145/3183713

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader