Skip to main content
Log in

Scalable clustering methods for the name disambiguation problem

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aygun R (2008) S2S: structural-to-syntactic matching similar documents. Knowl Inform Syst 16: 303–329

    Article  Google Scholar 

  2. Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: Proceedings of the SIAM data mining, November 2007

  3. Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international world wide web conference

  4. Cheng D, Kannan R, Vempala S, Wang G (2005) A divide-and-merge methodology for clustering. ACM Trans Database Syst

  5. Cohen W, Ravikumar P, Fienberg S (2003) A Comparison of string distance metrics for name-matching tasks. Proceedings of the IIWEB workshop

  6. Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of swarm intelligene symposium

  7. Dhillon I, Guan Y, Kulis B (2005) A Fast kernel-based multilevel algorithm for graph clustering. Proceedings of ACM SIGKDD conference on knowledge discovery and data mining

  8. Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst, September/October, 2–7

  9. Dorneles C, Goncalves R, Mello R (2010) Approximate data instance matching: a survey. Knowl Inform Syst

  10. Frey B, Dueck D (2007) Clustering by passing messages between data points. Science 315

  11. Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins university press, Baltimore

    MATH  Google Scholar 

  12. Halbert D (2008) Record linkage. Am J Publ Health 36(12): 1412–1416

    Google Scholar 

  13. Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In geographic data mining and knowledge discovery. Taylor and Francis, London

  14. Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital libraries, June 2005

  15. Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inform Syst 6: 710–727

    Article  Google Scholar 

  16. Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood Cliffs

    Google Scholar 

  17. Hendrickson B, Leland R (1992) An improved spectral graph partitioning algorithm for mapping parallel computations. Technical report, SAND92-1460, Sandia National Lab, Albuquerque

  18. Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. Sandia

  19. Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. ACM SIGMOD/PODS conference

  20. Hong Y, On B, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: Proceedings of European conference on digital libraies, Bath, UK, September 2004

  21. Howard S, Tang H, Berry M, Martin D (2009) GTP: general text parser. http://www.cs.utk.edu/~lsi/

  22. Karypis G, Kumar V (1996) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parallel Distributed Comput 48(1): 71–95

    Article  Google Scholar 

  23. Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, MD, USA, June 2005

  24. Li Z, Ng W, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8: 438–461

    Article  Google Scholar 

  25. Lu W, Milios J, Japkowicz M, Zhang Y (2006) Node similarity in the citation graph. Knowl Inform Syst 11: 105–129

    Article  Google Scholar 

  26. Meila M, Shi J (2001) A random walks view of spectral segmentation. In: Proceedings of the international conference on machine learning

  27. Newman M (2004) Detecting community structure in networks. Eur Phys J B(38): 321–330

    Google Scholar 

  28. On B, Elmacioglu E, Lee D, Kang J, Pei J (2006) Improving grouped-entity resolution using quasi-cliques. In: Proceedings of the IEEE international conference on data mining

  29. On B, Koudas N, Lee D, Srivastava D (2007) Group linkage. In: Proceedings of the IEEE international conference on data engineering

  30. On B, Lee D (2007) Scalable name disambiguation using multi-level graph partition. In: Proceedings of the SIAM international conference on data mining

  31. On B, Lee I (2009) Google based name search: resolving mixed entities on the Web. In: Proceedings of the international conference on digital information management

  32. Pasula H, Marthi B, Milch B, Russell S, Shapitser I (2003) Identity uncertainty and citation matching. Advances in neural information processing 15, MIT press, Cambridge

  33. Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl 11(3): 430–452

    Article  MathSciNet  MATH  Google Scholar 

  34. Pothen A, Simon H, Wang L, Bernard S (1992) Toward a fast implementation of spectral nested dissection. In: Proceedings of the SUPERCOM, pp 42–51

  35. SecondString: open-source java-based package of approximate string-matching. http://secondstring.sourceforge.net/

  36. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905

    Article  Google Scholar 

  37. Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the SIGIR

  38. Verma D, Meila M (2003) Spectral clustering toolbox. http://www.ms.washington.edu/~spectral/

  39. Wan X (2008) Beyond topical similarity: a structure similarity measure for retrieving highly similar document. Knowl Inform Syst 15: 55–73

    Article  Google Scholar 

  40. Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37

    Article  Google Scholar 

  41. Ye S, Wen J, Ma W (2007) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inform Syst 14: 217–232

    Article  Google Scholar 

  42. Yippy. http://search.yippy.com/

  43. Yu S, Shi J (2003) Multiclass spectral clustering. In: Proceedings of the international conference on computer vision

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byung-Won On.

Additional information

This paper was extended from the earlier conference paper that appeared in Ref. [31].

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I. & Lee, D. Scalable clustering methods for the name disambiguation problem. Knowl Inf Syst 31, 129–151 (2012). https://doi.org/10.1007/s10115-011-0397-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0397-1

Keywords

Navigation