Efficient entity resolution based on subgraph cohesion

Wang, Hongzhi; Li, Jianzhong; Gao, Hong

doi:10.1007/s10115-015-0818-7

Efficient entity resolution based on subgraph cohesion

Regular Paper
Published: 14 January 2015

Volume 46, pages 285–314, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hongzhi Wang¹,
Jianzhong Li¹ &
Hong Gao¹

614 Accesses
13 Citations
Explore all metrics

Abstract

Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that \(A\) and \(B\) refer to the same entity and \(B\) and \(C\) refer to the same entity, but \(A\) and \(C\) do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Article 29 February 2024

Density-Based Clustering Based on Hierarchical Density Estimates

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

References

Duplicate detection, record linkage, and identity uncertainty: datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html. Accessed 6 Oct 2013
Matching (graph theory) (2012). http://en.wikipedia.org/matching_(graph_theory). Accessed 15 Oct 2012
DBLP (2014). http://www.informatik.uni-trier.de/~ley/db/. Accessed 15 Jan 2014
Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: ICDE, pp 952–963
Aslam JA, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129
Article MathSciNet MATH Google Scholar
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255–276
Article Google Scholar
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: DMKD, pp 11–18
Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: VLDB, pp 327–338
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5
Chaudhuri S, Sarma AD, Ganti V, Kaushik R (2007) Leveraging aggregate constraints for deduplication. In: SIGMOD conference, pp 437–448
Chen Z, Kalashnikov DV, Mehrotra S (2009) Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD conference, pp 207–218
Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: SIGMOD conference, pp 85–96
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, Hoboken, NJ
MATH Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293
Google Scholar
Hassanzadeh O, Miller RJ (2009) Creating probabilistic databases from duplicated data. VLDB J 18(5):1141–1166
Article Google Scholar
Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
Kalashnikov DV, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767
Article Google Scholar
Kim HS, Lee D (2010) Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp 525–536
Koudas N, Saha A, Srivastava D, Venkatasubramanian S (2009) Metric functional dependencies. In: ICDE, pp 1275–1278
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge, MA
MATH Google Scholar
Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. PVLDB 3(1):208–219
Micali S, Vazirani VV (1980) An \(\text{ o }(\sqrt{|V|}|e|\)) algorithm for finding maximum matching in general graphs. In: FOCS, pp 17–27
Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI
Michelson M, Knoblock CA (2009) Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI, pp 422–428
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD, pp 269–278
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Article MathSciNet MATH Google Scholar
Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: ICDE, pp 196–205
Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI, pp 862–867
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD conference, pp 219–232
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD conference, pp 353–364
Yannakakis M (1978) Node- and edge-deletion np-complete problems. In: STOC, pp 253–264
Yin X, Han J, Yu PS (2007) Object distinction: distinguishing objects with identical names. In: ICDE, pp 1242–1246

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099,61133002 and National Sci-Tech Support Plan 2015BAH10F00.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Hongzhi Wang, Jianzhong Li & Hong Gao

Authors

Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Li, J. & Gao, H. Efficient entity resolution based on subgraph cohesion. Knowl Inf Syst 46, 285–314 (2016). https://doi.org/10.1007/s10115-015-0818-7

Download citation

Received: 20 November 2012
Revised: 10 November 2014
Accepted: 04 January 2015
Published: 14 January 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10115-015-0818-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient entity resolution based on subgraph cohesion

Abstract

Access this article

Similar content being viewed by others

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Density-Based Clustering Based on Hierarchical Density Estimates

Clustering graph data: the roadmap to spectral techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient entity resolution based on subgraph cohesion

Abstract

Access this article

Similar content being viewed by others

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Density-Based Clustering Based on Hierarchical Density Estimates

Clustering graph data: the roadmap to spectral techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation