Scalable clustering methods for the name disambiguation problem

On, Byung-Won; Lee, Ingyu; Lee, Dongwon

doi:10.1007/s10115-011-0397-1

Scalable clustering methods for the name disambiguation problem

Regular Paper
Published: 22 April 2011

Volume 31, pages 129–151, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Byung-Won On¹,
Ingyu Lee² &
Dongwon Lee³

302 Accesses
14 Citations
Explore all metrics

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aygun R (2008) S2S: structural-to-syntactic matching similar documents. Knowl Inform Syst 16: 303–329
Article Google Scholar
Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: Proceedings of the SIAM data mining, November 2007
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international world wide web conference
Cheng D, Kannan R, Vempala S, Wang G (2005) A divide-and-merge methodology for clustering. ACM Trans Database Syst
Cohen W, Ravikumar P, Fienberg S (2003) A Comparison of string distance metrics for name-matching tasks. Proceedings of the IIWEB workshop
Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of swarm intelligene symposium
Dhillon I, Guan Y, Kulis B (2005) A Fast kernel-based multilevel algorithm for graph clustering. Proceedings of ACM SIGKDD conference on knowledge discovery and data mining
Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst, September/October, 2–7
Dorneles C, Goncalves R, Mello R (2010) Approximate data instance matching: a survey. Knowl Inform Syst
Frey B, Dueck D (2007) Clustering by passing messages between data points. Science 315
Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins university press, Baltimore
MATH Google Scholar
Halbert D (2008) Record linkage. Am J Publ Health 36(12): 1412–1416
Google Scholar
Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In geographic data mining and knowledge discovery. Taylor and Francis, London
Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital libraries, June 2005
Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inform Syst 6: 710–727
Article Google Scholar
Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood Cliffs
Google Scholar
Hendrickson B, Leland R (1992) An improved spectral graph partitioning algorithm for mapping parallel computations. Technical report, SAND92-1460, Sandia National Lab, Albuquerque
Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. Sandia
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. ACM SIGMOD/PODS conference
Hong Y, On B, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: Proceedings of European conference on digital libraies, Bath, UK, September 2004
Howard S, Tang H, Berry M, Martin D (2009) GTP: general text parser. http://www.cs.utk.edu/~lsi/
Karypis G, Kumar V (1996) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parallel Distributed Comput 48(1): 71–95
Article Google Scholar
Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, MD, USA, June 2005
Li Z, Ng W, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8: 438–461
Article Google Scholar
Lu W, Milios J, Japkowicz M, Zhang Y (2006) Node similarity in the citation graph. Knowl Inform Syst 11: 105–129
Article Google Scholar
Meila M, Shi J (2001) A random walks view of spectral segmentation. In: Proceedings of the international conference on machine learning
Newman M (2004) Detecting community structure in networks. Eur Phys J B(38): 321–330
Google Scholar
On B, Elmacioglu E, Lee D, Kang J, Pei J (2006) Improving grouped-entity resolution using quasi-cliques. In: Proceedings of the IEEE international conference on data mining
On B, Koudas N, Lee D, Srivastava D (2007) Group linkage. In: Proceedings of the IEEE international conference on data engineering
On B, Lee D (2007) Scalable name disambiguation using multi-level graph partition. In: Proceedings of the SIAM international conference on data mining
On B, Lee I (2009) Google based name search: resolving mixed entities on the Web. In: Proceedings of the international conference on digital information management
Pasula H, Marthi B, Milch B, Russell S, Shapitser I (2003) Identity uncertainty and citation matching. Advances in neural information processing 15, MIT press, Cambridge
Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl 11(3): 430–452
Article MathSciNet MATH Google Scholar
Pothen A, Simon H, Wang L, Bernard S (1992) Toward a fast implementation of spectral nested dissection. In: Proceedings of the SUPERCOM, pp 42–51
SecondString: open-source java-based package of approximate string-matching. http://secondstring.sourceforge.net/
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Article Google Scholar
Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the SIGIR
Verma D, Meila M (2003) Spectral clustering toolbox. http://www.ms.washington.edu/~spectral/
Wan X (2008) Beyond topical similarity: a structure similarity measure for retrieving highly similar document. Knowl Inform Syst 15: 55–73
Article Google Scholar
Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37
Article Google Scholar
Ye S, Wen J, Ma W (2007) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inform Syst 14: 217–232
Article Google Scholar
Yippy. http://search.yippy.com/
Yu S, Shi J (2003) Multiclass spectral clustering. In: Proceedings of the international conference on computer vision

Download references

Author information

Authors and Affiliations

Advanced Digital Sciences Center, Illinois at Singapore Pte Ltd, 1 Fusionopolis Way, #08-10, Connexis North Tower, 138632, Singapore, Singapore
Byung-Won On
Sorrell College of Business, Troy University, Troy, AL, USA
Ingyu Lee
College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, USA
Dongwon Lee

Authors

Byung-Won On
View author publications
You can also search for this author in PubMed Google Scholar
Ingyu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Dongwon Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Byung-Won On.

Additional information

This paper was extended from the earlier conference paper that appeared in Ref. [31].

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I. & Lee, D. Scalable clustering methods for the name disambiguation problem. Knowl Inf Syst 31, 129–151 (2012). https://doi.org/10.1007/s10115-011-0397-1

Download citation

Received: 17 June 2010
Revised: 24 December 2010
Accepted: 23 February 2011
Published: 22 April 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s10115-011-0397-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable clustering methods for the name disambiguation problem

Abstract

Access this article

Similar content being viewed by others

Web Person Disambiguation Using Hierarchical Co-reference Model

A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering

An analysis of one-to-one matching algorithms for entity resolution

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable clustering methods for the name disambiguation problem

Abstract

Access this article

Similar content being viewed by others

Web Person Disambiguation Using Hierarchical Co-reference Model

A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering

An analysis of one-to-one matching algorithms for entity resolution

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation