Entity Resolution: Overview and Challenges

Garcia-Molina, Hector

doi:10.1007/978-3-540-30464-7_1

Hector Garcia-Molina²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3288))

Included in the following conference series:

International Conference on Conceptual Modeling

1042 Accesses
7 Citations

Abstract

Entity resolution is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An entity resolution algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can. Entity resolution algorithms typically rely on user-defined functions that (a) compare fields or records to determine if they match (are likely to represent the same real world entity), and (b) merge matching records into one, and in the process perhaps combine fields (e.g., creating a new name based on two slightly different versions of the name).

In this talk I will give an overview of the Stanford SERF Project, that is building a framework to describe and evaluate entity resolution schemes. In particular, I will give an overview of some of the different entity resolution settings:

De-duplication versus fidelity enhancement. In the de-duplication problem, we have a single set of records, and we try to merge the ones representing the same real world entity. In the fidelity enhancement problem, we have two sets of records: a base set of records of interest, and a new set of acquired information. The goal is to coalesce the new information into the base records.
Clustering versus snapping. With snapping, we examine records pair-wise and decide if they represent the same entity. If they do, we merge the records into one, and continue the process of pair-wise comparisons. With clustering, we analyze all records and partition them into groups we believe represent the same real world entity. At the end, each partition is merged into one record.
Confidences. In some entity resolution scenarios we must manage confidences. For example, input records may have a confidence value representing how likely it is they are true. Snap rules (that tells us when two records match) may also have confidences representing how likely it is that two records actually represent the same real world entity. As we merge records, we must track their confidences.
Schema Mismatches. In some entity resolution scenarios we must deal, not just with resolving information on entities, but also with resolving discrepancies among the schemas of the different sources. For example, the attribute names and formats from one source may not match those of other sources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Hector Garcia-Molina

Authors

Hector Garcia-Molina
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Informatica e Automazione, Università Roma Tre, Via Vasca Navale 79, 00146, Roma, Italy
Paolo Atzeni
Computer Science Department, University of California, 3731 Boelter Hall, 90095, Los Angeles, CA, USA
Wesley Chu
Department of Computer Science, Tsinghua University, 100084, Beijing, P.R. China
Hongjun Lu
Department of Computer Science and Engineering, Fudan University, 200433, China
Shuigeng Zhou
School of Computing, National University of Singapore,
Tok-Wang Ling

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia-Molina, H. (2004). Entity Resolution: Overview and Challenges. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, TW. (eds) Conceptual Modeling – ER 2004. ER 2004. Lecture Notes in Computer Science, vol 3288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30464-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-30464-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23723-5
Online ISBN: 978-3-540-30464-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics