Skip to main content

Entity Resolution

  • Reference work entry
  • 224 Accesses

Synonyms

Co-reference resolution; Deduplication; Duplicate detection; Identity uncertainty; Merge-purge; Object consolidation; Record linkage; Reference reconciliation

Definition

A fundamental problem in data cleaning and integration (see Data Preparation) is dealing with uncertain and imprecise references to real-world entities. The goal of entity resolution is a take a collection of uncertain entity references (or references, in short) from a single data source or multiple data sources, discover the unique set of underlying entities, and map each reference to its corresponding entity. This typically involves two subproblems – identification of references with different attributes to the same entity, and disambiguation of references with identical attributes by assigning them to different entities.

Motivation and Background

Entity resolution is a common problem that comes up in different guises (and is given different names) in many computer science domains. Examples include computer...

This is a preview of subscription content, log in via an institution.

Recommended Reading

  • Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In The SIAM international conference on data mining (SIAM-SDM), Bethesda, MD, USA.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM transactions on knowledge discovery from data, 1(1), 5.

    Article  Google Scholar 

  • Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), Washington, DC.

    Google Scholar 

  • Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on management of data (pp. 313–324). San Diego, CA.

    Google Scholar 

  • Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 workshop on information integration on the web (pp. 73–78). Acapulco, Mexico.

    Google Scholar 

  • Dong,X.,Halevy,A.,&Madhavan,J.(2005).Referencereconciliationincomplex information spaces. In The ACM international conference on management of data (SIGMOD), Baltimore, MD, USA.

    Google Scholar 

  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210.

    Article  Google Scholar 

  • Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2003). Text joins for data cleansing and integration in an rdbms. In 19th IEEE international conference on data engineering.

    Google Scholar 

  • Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD international conference on management of data (SIGMOD-95) (pp. 127–138). San Jose, CA.

    Google Scholar 

  • Kalashnikov, D. V., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In SIAM international conference on data mining (SIAM SDM), April 21–23 2005, Newport Beach, CA, USA.

    Google Scholar 

  • Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special issue on semantic integration, 26(1).

    Google Scholar 

  • McCallum, A., & Wellner, B. (2004). Conditional models of identity uncertainty with application to noun coreference. In NIPS, Vancouver, BC.

    Google Scholar 

  • Menestrina, D., Benjelloun, O., & Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First Int’l VLDB workshop on clean databases, Seoul, Korea.

    Google Scholar 

  • Monge, A. E., & Elkan, C. P. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery (pp. 23–29). Tuscon, AZ.

    Google Scholar 

  • Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. In Advances in neural information processing systems 15. Cambridge, MA: MIT Press.

    Google Scholar 

  • Singla, P., & Domingos, P. (2004). Multi-relational record linkage. In Proceedings of 3rd workshop on multi-relational data mining at ACM SI GKDD, Seattle, WA.

    Google Scholar 

  • Winkler, W. E. (2002). Methods for record linkage and Bayesian networks. Technical Report, Statistical Research Division, U.S. Census Bureau, Washington, DC.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Bhattacharya, I., Getoor, L. (2011). Entity Resolution. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_254

Download citation

Publish with us

Policies and ethics