poster

Efficient entity resolution for large heterogeneous information spaces

Authors:

George Papadakis,

Ekaterini Ioannou,

Claudia Niederée,

Peter FankhauserAuthors Info & Claims

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 535 - 544

https://doi.org/10.1145/1935826.1935903

Published: 09 February 2011 Publication History

Abstract

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

References

[1]

A. N. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI, 2005.

Digital Library

[2]

M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87--96, 2006.

Digital Library

[3]

W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, 2003.

Digital Library

[4]

T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 305--314, 2009.

Digital Library

[5]

A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 2005.

Digital Library

[6]

X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.

Digital Library

[7]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 2007.

Digital Library

[8]

L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explorations, 2005.

Digital Library

[9]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[10]

A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006.

Digital Library

[11]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Conference, pages 127--138, 1995.

Digital Library

[12]

E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic entity linkage for heterogeneous information spaces. In CAiSE, pages 556--570, 2008.

Digital Library

[13]

L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA, 2003.

Digital Library

[14]

D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. TODS, 2006.

Digital Library

[15]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.

Digital Library

[16]

J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.

[17]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.

Digital Library

[18]

M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.

Digital Library

[19]

H. B. Newcombe and J. M. Kennedy. Record linkage: making max-imum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962.

Digital Library

[20]

B.-W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, 2007.

[21]

S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In KDD, 2002.

Digital Library

[22]

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIG-MOD Conference, pages 219--232, 2009.

Digital Library

[23]

W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, 2006.

[24]

M. Zhong, M. Liu, and Q. Chen. Modeling heterogeneous data in dataspace. In IRI, pages 404--409, 2008.

Cited By

Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Moslemi MBalamurugan HMilani M(2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825531
Ravikanth MKorra SMamidisetti GGoutham MBhaskar T(2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
https://doi.org/10.1038/s41598-024-63242-1
Show More Cited By

Index Terms

Efficient entity resolution for large heterogeneous information spaces
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large ...
Scaling entity resolution: A loosely schema-aware approach
Abstract
In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant ...
Highlights
- A LSH-based attribute-match induction technique to extract loose schema information.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

February 2011

870 pages

ISBN:9781450304931

DOI:10.1145/1935826

General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

WSDM'11

Sponsor:

WSDM'11: Fourth ACM International Conference on Web Search and Data Mining

February 9 - 12, 2011

Hong Kong, China

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
603
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Moslemi MBalamurugan HMilani M(2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825531
Ravikanth MKorra SMamidisetti GGoutham MBhaskar T(2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
https://doi.org/10.1038/s41598-024-63242-1
Eibeck AZhang SLim MKraft M(2024)A simple and efficient approach to unsupervised instance matching and its application to linked data of power plantsJournal of Web Semantics10.1016/j.websem.2024.10081580(100815)Online publication date: Apr-2024
https://doi.org/10.1016/j.websem.2024.100815
Zeakis APapadakis GSkoutas DKoubarakis M(2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
https://doi.org/10.1007/s00778-024-00879-4
Obraczka DRahm E(2024)Comparing Symbolic and Embedding-Based Approaches for Relational BlockingKnowledge Engineering and Knowledge Management10.1007/978-3-031-77792-9_10(155-173)Online publication date: 25-Nov-2024
https://dl.acm.org/doi/10.1007/978-3-031-77792-9_10
Zeakis APapadakis GSkoutas DKoubarakis M(2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598594
Azeroual ONikiforova ASha K(2023)Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS58795.2023.10193478(65-73)Online publication date: 19-Jun-2023
https://doi.org/10.1109/ICCNS58795.2023.10193478
Papadakis GEfthymiou VThanos EHassanzadeh OChristen P(2023)An analysis of one-to-one matching algorithms for entity resolutionThe VLDB Journal10.1007/s00778-023-00791-332:6(1369-1400)Online publication date: 18-Apr-2023
https://doi.org/10.1007/s00778-023-00791-3
Backes TDietze S(2022)Lattice-based progressive author disambiguationInformation Systems10.1016/j.is.2022.102056109:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.is.2022.102056
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten