ABSTRACT
Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.Google Scholar
- Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, et almbox.. 2016. End-to-end attention-based large vocabulary speech recognition. IEEE ICASSP.Google Scholar
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR (March. 2003), 1137--1155. Google ScholarDigital Library
- Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD. Google ScholarDigital Library
- Piotr Bojanowski, Edouard Grave, Armand Joulin, et almbox.. 2016. Enriching Word Vectors with Subword Information. CoRR Vol. abs/1607.04606 (2016).Google Scholar
- Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, et almbox.. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. CoRR Vol. abs/1708.01353 (2017).Google Scholar
- Kyunghyun Cho et almbox.. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.Google Scholar
- Peter Christen. 2012. Data Matching. Springer.Google Scholar
- Kevin Clark et almbox.. 2016. Improving coreference resolution by learning entity-level distributed representations. CoRR Vol. abs/1606.01323 (2016).Google Scholar
- William W. Cohen. 2016. TensorLog: A Differentiable Deductive Database. CoRR Vol. abs/1605.06523 (2016).Google Scholar
- Ronan Collobert et almbox.. 2011 a. Natural language processing (almost) from scratch. JMLR. Google ScholarDigital Library
- R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011 b. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.Google Scholar
- Ido Dagan, Dan Roth, Fabio Zanzotto, and Graeme Hirst. 2012. Recognizing Textual Entailment. Morgan &Claypool Publishers. Google ScholarDigital Library
- Sanjib Das et almbox.. {n. d.}. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data. (.{n. d.}).Google Scholar
- Bhuwan Dhingra, Hanxiao Liu, et almbox.. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR Vol. abs/1703.00993 (2017).Google Scholar
- Jens Dittrich. 2017. Deep Learning (m)eats Databases. VLDB Keynote.Google Scholar
- Muhammad Ebraheem, Saravanan Thirumuruganathan, et almbox.. 2017. DeepER--Deep Entity Resolution. CoRR Vol. abs/1710.00597 (2017).Google Scholar
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE Vol. 19, 1 (Jan. . 2007), 1--16. nMaarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. ACL. Google ScholarDigital Library
- Massimo Nicosia and Alessandro Moschitti. 2017. Accurate Sentence Matching with Hybrid Siamese Networks. CIKM. Google ScholarDigital Library
- George Papadakis, Jonathan Svirsky, Avigdor Gal, et almbox.. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. VLDB. Google ScholarDigital Library
- Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. EMNLP.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. EMNLP.Google Scholar
- Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, et almbox.. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB. Google ScholarDigital Library
- Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks Vol. 61 (2015), 85--117. Google ScholarDigital Library
- Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, and Erhard Rahm. 2015. Privacy Preserving Record Linkage with PPJoin. BTW.Google Scholar
- Uri Shaham, Xiuyuan Cheng, Omer Dror, et almbox.. 2016. A Deep Learning Approach to Unsupervised Ensemble Learning. ICML. Google ScholarDigital Library
- Tao Shen et almbox.. 2017. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding. CoRR Vol. abs/1709.04696 (2017).Google Scholar
- Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE Vol. 27, 2 (2015), 443--460.Google ScholarCross Ref
- Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, et almbox.. 2017. Generating Concise Entity Matching Rules. SIGMOD. Google ScholarDigital Library
- Parag Singla et almbox.. 2006. Entity Resolution with Markov Logic. ICDM. Google ScholarDigital Library
- Richard Socher et almbox.. 2013 a. Parsing with compositional vector grammars. ACL.Google Scholar
- Richard Socher et almbox.. 2013 b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.Google Scholar
- Rupesh Kumar Srivastava et almbox.. 2015. Highway networks. ICML.Google Scholar
- Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, et almbox.. 2013. Data Curation at Scale: The Data Tamer System. CIDR.Google Scholar
- Hendrik Strobelt et almbox.. 2016. Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. CoRR abs/1606.07461 (2016).Google Scholar
- Yaming Sun, Lei Lin, Duyu Tang, et almbox.. 2015. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation. IJCAI. Google ScholarDigital Library
- Ilya Sutskever. 2013. Training recurrent neural networks. Ph.D. Dissertation. bibinfoschoolUniversity of Toronto. Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. NIPS. Google ScholarDigital Library
- Ming Tan et almbox.. 2016. Improved Representation Learning for Question Answer Matching. ACL.Google Scholar
- Ashish Vaswani et almbox.. 2017. Attention Is All You Need. NIPS.Google Scholar
- Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et almbox.. 2016. Matching networks for one shot learning. ACL.Google ScholarDigital Library
- Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. VLDB. Google ScholarDigital Library
- Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching Text Sequences. ICLR.Google Scholar
- Wei Wang et almbox.. 2016. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record Vol. 45, 2 (2016), 17--22. Google ScholarDigital Library
- Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning Global Features for Coreference Resolution. NAACL.Google Scholar
- Sen Wu, Luke Hsiao, Xiao Cheng, et almbox.. 2017. Fonduer: Knowledge Base Construction from Richly Formatted Data. CoRR Vol. abs/1703.05028 (2017). Google ScholarDigital Library
- Wenpeng Yin et almbox.. 2016 a. Simple Question Answering by Attentive Convolutional Neural Network. COLING.Google Scholar
- Wenpeng Yin, Mo Yu, Bing Xiang, et almbox.. 2016 b. Simple question answering by attentive convolutional neural network. CoRR Vol. abs/1606.03391 (2016).Google Scholar
- Radu Florian Zhiguo Wang, Wael Hamza. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. IJCAI.Google Scholar
Index Terms
- Deep Learning for Entity Matching: A Design Space Exploration
Recommendations
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementEntity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience PapersEntity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Comments