research-article

Deep Learning for Entity Matching: A Design Space Exploration

Authors:

Sidharth Mudgal,

Theodoros Rekatsinas,

Youngchoon Park,

Ganesh Krishnan,

Esteban Arcaute,

Vijay RaghavendraAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 19 - 34

https://doi.org/10.1145/3183713.3196926

Published: 27 May 2018 Publication History

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

References

[1]

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.

[3]

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, et almbox. 2016. End-to-end attention-based large vocabulary speech recognition. IEEE ICASSP.

[4]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR (March. 2003), 1137--1155.

Digital Library

[5]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD.

Digital Library

[6]

Piotr Bojanowski, Edouard Grave, Armand Joulin, et almbox. 2016. Enriching Word Vectors with Subword Information. CoRR Vol. abs/1607.04606 (2016).

[7]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, et almbox. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. CoRR Vol. abs/1708.01353 (2017).

[8]

Kyunghyun Cho et almbox. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.

[9]

Peter Christen. 2012. Data Matching. Springer.

[10]

Kevin Clark et almbox. 2016. Improving coreference resolution by learning entity-level distributed representations. CoRR Vol. abs/1606.01323 (2016).

[11]

William W. Cohen. 2016. TensorLog: A Differentiable Deductive Database. CoRR Vol. abs/1605.06523 (2016).

[12]

Ronan Collobert et almbox. 2011 a. Natural language processing (almost) from scratch. JMLR.

Digital Library

[13]

R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011 b. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.

[14]

Ido Dagan, Dan Roth, Fabio Zanzotto, and Graeme Hirst. 2012. Recognizing Textual Entailment. Morgan &Claypool Publishers.

Digital Library

[15]

Sanjib Das et almbox. {n. d.}. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data. (.{n. d.}).

[16]

Bhuwan Dhingra, Hanxiao Liu, et almbox. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR Vol. abs/1703.00993 (2017).

[17]

Jens Dittrich. 2017. Deep Learning (m)eats Databases. VLDB Keynote.

[18]

Muhammad Ebraheem, Saravanan Thirumuruganathan, et almbox. 2017. DeepER--Deep Entity Resolution. CoRR Vol. abs/1710.00597 (2017).

[19]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE Vol. 19, 1 (Jan. . 2007), 1--16. nMaarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. ACL.

Digital Library

[20]

Massimo Nicosia and Alessandro Moschitti. 2017. Accurate Sentence Matching with Hybrid Siamese Networks. CIKM.

Digital Library

[21]

George Papadakis, Jonathan Svirsky, Avigdor Gal, et almbox. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. VLDB.

Digital Library

[22]

Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. EMNLP.

[23]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. EMNLP.

[24]

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, et almbox. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB.

Digital Library

[25]

Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks Vol. 61 (2015), 85--117.

Digital Library

[26]

Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, and Erhard Rahm. 2015. Privacy Preserving Record Linkage with PPJoin. BTW.

[27]

Uri Shaham, Xiuyuan Cheng, Omer Dror, et almbox. 2016. A Deep Learning Approach to Unsupervised Ensemble Learning. ICML.

Digital Library

[28]

Tao Shen et almbox. 2017. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding. CoRR Vol. abs/1709.04696 (2017).

[29]

Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE Vol. 27, 2 (2015), 443--460.

[30]

Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, et almbox. 2017. Generating Concise Entity Matching Rules. SIGMOD.

Digital Library

[31]

Parag Singla et almbox. 2006. Entity Resolution with Markov Logic. ICDM.

Digital Library

[32]

Richard Socher et almbox. 2013 a. Parsing with compositional vector grammars. ACL.

[33]

Richard Socher et almbox. 2013 b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.

[34]

Rupesh Kumar Srivastava et almbox. 2015. Highway networks. ICML.

[35]

Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, et almbox. 2013. Data Curation at Scale: The Data Tamer System. CIDR.

[36]

Hendrik Strobelt et almbox. 2016. Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. CoRR abs/1606.07461 (2016).

[37]

Yaming Sun, Lei Lin, Duyu Tang, et almbox. 2015. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation. IJCAI.

Digital Library

[38]

Ilya Sutskever. 2013. Training recurrent neural networks. Ph.D. Dissertation. bibinfoschoolUniversity of Toronto.

Digital Library

[39]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. NIPS.

Digital Library

[40]

Ming Tan et almbox. 2016. Improved Representation Learning for Question Answer Matching. ACL.

[41]

Ashish Vaswani et almbox. 2017. Attention Is All You Need. NIPS.

[42]

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et almbox. 2016. Matching networks for one shot learning. ACL.

Digital Library

[43]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. VLDB.

Digital Library

[44]

Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching Text Sequences. ICLR.

[45]

Wei Wang et almbox. 2016. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record Vol. 45, 2 (2016), 17--22.

Digital Library

[46]

Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning Global Features for Coreference Resolution. NAACL.

[47]

Sen Wu, Luke Hsiao, Xiao Cheng, et almbox. 2017. Fonduer: Knowledge Base Construction from Richly Formatted Data. CoRR Vol. abs/1703.05028 (2017).

Digital Library

[48]

Wenpeng Yin et almbox. 2016 a. Simple Question Answering by Attentive Convolutional Neural Network. COLING.

[49]

Wenpeng Yin, Mo Yu, Bing Xiang, et almbox. 2016 b. Simple question answering by attentive convolutional neural network. CoRR Vol. abs/1606.03391 (2016).

[50]

Radu Florian Zhiguo Wang, Wael Hamza. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. IJCAI.

Cited By

Zhu XXie MDeng TZhang Q(2025)HyperBlocker: Accelerating Rule-Based Blocking in Entity Resolution Using GPUsProceedings of the VLDB Endowment10.14778/3705829.370584718:2(308-321)Online publication date: 28-Feb-2025
https://doi.org/10.14778/3705829.3705847
Maciejewski JNikoletos KPapadakis GVelegrakis Y(2025)Progressive Entity Matching: A Design Space ExplorationProceedings of the ACM on Management of Data10.1145/37097153:1(1-25)Online publication date: 11-Feb-2025
https://doi.org/10.1145/3709715
Bischof LTeodoropol SFüchslin RStockinger K(2025)Hybrid quantum neural networks show strongly reduced need for free parameters in entity matchingScientific Reports10.1038/s41598-025-88177-z15:1Online publication date: 5-Feb-2025
https://doi.org/10.1038/s41598-025-88177-z
Show More Cited By

Index Terms

Deep Learning for Entity Matching: A Design Space Exploration
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

339
Total Citations
View Citations
3,957
Total Downloads

Downloads (Last 12 months)399
Downloads (Last 6 weeks)43

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XXie MDeng TZhang Q(2025)HyperBlocker: Accelerating Rule-Based Blocking in Entity Resolution Using GPUsProceedings of the VLDB Endowment10.14778/3705829.370584718:2(308-321)Online publication date: 28-Feb-2025
https://doi.org/10.14778/3705829.3705847
Maciejewski JNikoletos KPapadakis GVelegrakis Y(2025)Progressive Entity Matching: A Design Space ExplorationProceedings of the ACM on Management of Data10.1145/37097153:1(1-25)Online publication date: 11-Feb-2025
https://doi.org/10.1145/3709715
Bischof LTeodoropol SFüchslin RStockinger K(2025)Hybrid quantum neural networks show strongly reduced need for free parameters in entity matchingScientific Reports10.1038/s41598-025-88177-z15:1Online publication date: 5-Feb-2025
https://doi.org/10.1038/s41598-025-88177-z
Araújo TEfthymiou VChristophides VPitoura EStefanidis K(2025)TREATS: Fairness-aware entity resolution over streaming dataInformation Systems10.1016/j.is.2024.102506129(102506)Online publication date: Mar-2025
https://doi.org/10.1016/j.is.2024.102506
Zhang HChen QXue BWang YZhou AZhang M(2025)EvoFeat: Genetic Programming-Based Feature Engineering Approach to Tabular Data ClassificationGenetic Programming Theory and Practice XXI10.1007/978-981-96-0077-9_2(27-49)Online publication date: 28-Feb-2025
https://doi.org/10.1007/978-981-96-0077-9_2
Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Torres NOlivares P(2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
https://doi.org/10.3390/data9060075
Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685889
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 6-Aug-2024
https://doi.org/10.14778/3675034.3675051
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten