skip to main content
10.1145/1935826.1935903acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Efficient entity resolution for large heterogeneous information spaces

Published: 09 February 2011 Publication History

Abstract

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

References

[1]
A. N. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI, 2005.
[2]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87--96, 2006.
[3]
W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, 2003.
[4]
T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 305--314, 2009.
[5]
A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 2005.
[6]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
[7]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 2007.
[8]
L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explorations, 2005.
[9]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[10]
A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006.
[11]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Conference, pages 127--138, 1995.
[12]
E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic entity linkage for heterogeneous information spaces. In CAiSE, pages 556--570, 2008.
[13]
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA, 2003.
[14]
D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. TODS, 2006.
[15]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
[16]
J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.
[17]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.
[18]
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.
[19]
H. B. Newcombe and J. M. Kennedy. Record linkage: making max-imum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962.
[20]
B.-W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, 2007.
[21]
S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In KDD, 2002.
[22]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIG-MOD Conference, pages 219--232, 2009.
[23]
W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, 2006.
[24]
M. Zhong, M. Liu, and Q. Chen. Modeling heterogeneous data in dataspace. In IRI, pages 404--409, 2008.

Cited By

View all
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • (2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
  • (2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
  • Show More Cited By

Index Terms

  1. Efficient entity resolution for large heterogeneous information spaces

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
      February 2011
      870 pages
      ISBN:9781450304931
      DOI:10.1145/1935826
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 February 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. attribute-agnostic blocking
      2. data cleaning
      3. entity resolution

      Qualifiers

      • Poster

      Conference

      Acceptance Rates

      WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)27
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
      • (2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
      • (2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
      • (2024)A simple and efficient approach to unsupervised instance matching and its application to linked data of power plantsJournal of Web Semantics10.1016/j.websem.2024.10081580(100815)Online publication date: Apr-2024
      • (2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
      • (2024)Comparing Symbolic and Embedding-Based Approaches for Relational BlockingKnowledge Engineering and Knowledge Management10.1007/978-3-031-77792-9_10(155-173)Online publication date: 25-Nov-2024
      • (2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
      • (2023)Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS58795.2023.10193478(65-73)Online publication date: 19-Jun-2023
      • (2023)An analysis of one-to-one matching algorithms for entity resolutionThe VLDB Journal10.1007/s00778-023-00791-332:6(1369-1400)Online publication date: 18-Apr-2023
      • (2022)Lattice-based progressive author disambiguationInformation Systems10.1016/j.is.2022.102056109:COnline publication date: 1-Nov-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media