skip to main content
10.1145/1066157.1066168acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Reference reconciliation in complex information spaces

Published: 14 June 2005 Publication History

Abstract

Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.

References

[1]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002.
[2]
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004.
[3]
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003.
[4]
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003.
[5]
V. Bush. As we may think. The Atlantic Monthly, 1945.
[6]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. of SIGMOD, 2003.
[7]
Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.
[8]
W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.
[9]
W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000.
[10]
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.
[11]
http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.
[12]
A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.
[13]
X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.
[14]
X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.
[15]
X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.
[16]
S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i've seen: A system for personal information retrieval and re-use. In SIGIR, 2003.
[17]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.
[18]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: language, model, and algorithms. In VLDB, pages 371--380, 2001.
[19]
Google. http://desktop.google.com/, 2004.
[20]
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.
[21]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.
[22]
L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA, 2003.
[23]
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.
[24]
M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: a knowledge-based intelligent data cleaner. In SIGKDD, pages 290--294, 2000.
[25]
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000.
[26]
A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.
[27]
A. K. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In SIGKDD, 2000.
[28]
M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.
[29]
H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.
[30]
Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.
[31]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.
[32]
J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.
[33]
D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.
[34]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.
[35]
S. Tejada, C. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002.
[36]
W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.
[37]
W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
June 2005
990 pages
ISBN:1595930604
DOI:10.1145/1066157
  • Conference Chair:
  • Fatma Ozcan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2005

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)A Review of Anonymization for Healthcare DataBig Data10.1089/big.2021.016912:6(538-555)Online publication date: 1-Dec-2024
  • (2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
  • (2023)Toward an Efficient and Effective Credit Scorer for Cross-Border E-Commerce EnterprisesScientific Programming10.1155/2023/52810502023Online publication date: 1-Jan-2023
  • (2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
  • (2023)A Semantic Corpus of Russian Literature of 18 Century: Its Current State and Its FutureLiterature, Language and Computing10.1007/978-981-99-3604-5_11(121-128)Online publication date: 14-Jul-2023
  • (2022)Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge BaseFuture Internet10.3390/fi1402003914:2(39)Online publication date: 25-Jan-2022
  • (2022)LACE: A Logical Approach to Collective Entity ResolutionProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526233(379-391)Online publication date: 12-Jun-2022
  • (2022)Statistical inference links data and theory in network scienceNature Communications10.1038/s41467-022-34267-913:1Online publication date: 10-Nov-2022
  • (2021)Record Linkage of Chinese Patent Inventors and Authors of Scientific ArticlesApplied Sciences10.3390/app1118841711:18(8417)Online publication date: 10-Sep-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media