Entity Resolution

Bhattacharya, Indrajit; Getoor, Lise

doi:10.1007/978-0-387-30164-8_254

Entity Resolution

Indrajit Bhattacharya &
Lise Getoor

Reference work entry

224 Accesses

Synonyms

Co-reference resolution; Deduplication; Duplicate detection; Identity uncertainty; Merge-purge; Object consolidation; Record linkage; Reference reconciliation

Definition

A fundamental problem in data cleaning and integration (see Data Preparation) is dealing with uncertain and imprecise references to real-world entities. The goal of entity resolution is a take a collection of uncertain entity references (or references, in short) from a single data source or multiple data sources, discover the unique set of underlying entities, and map each reference to its corresponding entity. This typically involves two subproblems – identification of references with different attributes to the same entity, and disambiguation of references with identical attributes by assigning them to different entities.

Motivation and Background

Entity resolution is a common problem that comes up in different guises (and is given different names) in many computer science domains. Examples include computer...

This is a preview of subscription content, log in via an institution.

Recommended Reading

Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In The SIAM international conference on data mining (SIAM-SDM), Bethesda, MD, USA.
Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM transactions on knowledge discovery from data, 1(1), 5.
Article Google Scholar
Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), Washington, DC.
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on management of data (pp. 313–324). San Diego, CA.
Google Scholar
Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 workshop on information integration on the web (pp. 73–78). Acapulco, Mexico.
Google Scholar
Dong,X.,Halevy,A.,&Madhavan,J.(2005).Referencereconciliationincomplex information spaces. In The ACM international conference on management of data (SIGMOD), Baltimore, MD, USA.
Google Scholar
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210.
Article Google Scholar
Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2003). Text joins for data cleansing and integration in an rdbms. In 19th IEEE international conference on data engineering.
Google Scholar
Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD international conference on management of data (SIGMOD-95) (pp. 127–138). San Jose, CA.
Google Scholar
Kalashnikov, D. V., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In SIAM international conference on data mining (SIAM SDM), April 21–23 2005, Newport Beach, CA, USA.
Google Scholar
Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special issue on semantic integration, 26(1).
Google Scholar
McCallum, A., & Wellner, B. (2004). Conditional models of identity uncertainty with application to noun coreference. In NIPS, Vancouver, BC.
Google Scholar
Menestrina, D., Benjelloun, O., & Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First Int’l VLDB workshop on clean databases, Seoul, Korea.
Google Scholar
Monge, A. E., & Elkan, C. P. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery (pp. 23–29). Tuscon, AZ.
Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. In Advances in neural information processing systems 15. Cambridge, MA: MIT Press.
Google Scholar
Singla, P., & Domingos, P. (2004). Multi-relational record linkage. In Proceedings of 3rd workshop on multi-relational data mining at ACM SI GKDD, Seattle, WA.
Google Scholar
Winkler, W. E. (2002). Methods for record linkage and Bayesian networks. Technical Report, Statistical Research Division, U.S. Census Bureau, Washington, DC.
Google Scholar

Download references

Author information

Authors and Affiliations

Authors

Indrajit Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Lise Getoor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Bhattacharya, I., Getoor, L. (2011). Entity Resolution. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_254

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_254
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics