Article

Reference reconciliation in complex information spaces

Authors:

Jayant MadhavanAuthors Info & Claims

SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data

Pages 85 - 96

https://doi.org/10.1145/1066157.1066168

Published: 14 June 2005 Publication History

Abstract

Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.

References

[1]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002.

Digital Library

[2]

I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004.

Digital Library

[3]

M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003.

Digital Library

[4]

M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003.

Digital Library

[5]

V. Bush. As we may think. The Atlantic Monthly, 1945.

[6]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. of SIGMOD, 2003.

Digital Library

[7]

Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.

[8]

W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.

[9]

W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000.

Digital Library

[10]

W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.

Digital Library

[11]

http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.

[12]

A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.

[13]

X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.

[14]

X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.

Digital Library

[15]

X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.

[16]

S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i've seen: A system for personal information retrieval and re-use. In SIGIR, 2003.

Digital Library

[17]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.

[18]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: language, model, and algorithms. In VLDB, pages 371--380, 2001.

Digital Library

[19]

Google. http://desktop.google.com/, 2004.

[20]

L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.

[21]

M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.

Digital Library

[22]

L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA, 2003.

Digital Library

[23]

D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.

[24]

M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: a knowledge-based intelligent data cleaner. In SIGKDD, pages 290--294, 2000.

Digital Library

[25]

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000.

Digital Library

[26]

A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.

[27]

A. K. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In SIGKDD, 2000.

Digital Library

[28]

M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.

[29]

H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.

[30]

Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.

[31]

H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.

[32]

J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.

[33]

D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.

Digital Library

[34]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.

Digital Library

[35]

S. Tejada, C. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002.

Digital Library

[36]

W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.

[37]

W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Olatunji IRauch JKatzensteiner MKhosla M(2024)A Review of Anonymization for Healthcare DataBig Data10.1089/big.2021.016912:6(538-555)Online publication date: 1-Dec-2024
https://doi.org/10.1089/big.2021.0169
Sarıkoz S(2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
https://doi.org/10.36937/ben.2023.4798
Show More Cited By

Recommendations

A graphical method for reference reconciliation
DASFAA'10: Proceedings of the 15th international conference on Database systems for advanced applications

In many applications several references may refer to one real entity, the task of reference reconciliation is to group those references into several clusters so that each cluster associates with only one real entity. In this paper we propose a new ...
Multi-attribute spaces: Calibration for attribute fusion and similarity search
CVPR '12: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Recent work has shown that visual attributes are a powerful approach for applications such as recognition, image description and retrieval. However, fusing multiple attribute scores — as required during multi-attribute queries or similarity searches — ...
A Mutual-Information-Based Approach to Entity Reconciliation in Heterogeneous Databases
CSSE '08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 01

Entity reconciliation is crucial to data interoperability in heterogeneous databases. In our previous research works, we proposed an entities matching algorithm based on attribute entropy to identify the corresponding entities, which can resolve the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data

June 2005

990 pages

ISBN:1595930604

DOI:10.1145/1066157

Conference Chair:
Fatma Ozcan
IBM Almaden Research Center

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS05

Sponsor:

SIGMOD/PODS05: International Conference on Management of Data and Symposium on Principles Database and Systems

June 14 - 16, 2005

Maryland, Baltimore

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

397
Total Citations
View Citations
1,851
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Olatunji IRauch JKatzensteiner MKhosla M(2024)A Review of Anonymization for Healthcare DataBig Data10.1089/big.2021.016912:6(538-555)Online publication date: 1-Dec-2024
https://doi.org/10.1089/big.2021.0169
Sarıkoz S(2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
https://doi.org/10.36937/ben.2023.4798
Xu CGuo RZhang YLuo X(2023)Toward an Efficient and Effective Credit Scorer for Cross-Border E-Commerce EnterprisesScientific Programming10.1155/2023/52810502023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5281050
Kirielle NChristen PRanbaduge T(2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
https://dl.acm.org/doi/10.1145/3533016
Ponomareva M(2023)A Semantic Corpus of Russian Literature of 18 Century: Its Current State and Its FutureLiterature, Language and Computing10.1007/978-981-99-3604-5_11(121-128)Online publication date: 14-Jul-2023
https://doi.org/10.1007/978-981-99-3604-5_11
Wu MJiang TBu CZhu B(2022)Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge BaseFuture Internet10.3390/fi1402003914:2(39)Online publication date: 25-Jan-2022
https://doi.org/10.3390/fi14020039
Bienvenu MCima GGutiérrez-Basulto VLibkin LBarceló P(2022)LACE: A Logical Approach to Collective Entity ResolutionProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526233(379-391)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3517804.3526233
Peel LPeixoto TDe Domenico M(2022)Statistical inference links data and theory in network scienceNature Communications10.1038/s41467-022-34267-913:1Online publication date: 10-Nov-2022
https://doi.org/10.1038/s41467-022-34267-9
Nowak RFranus WZhang JZhu YTian XZhang ZChen XLiu X(2021)Record Linkage of Chinese Patent Inventors and Authors of Scientific ArticlesApplied Sciences10.3390/app1118841711:18(8417)Online publication date: 10-Sep-2021
https://doi.org/10.3390/app11188417
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten