Abstract
In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.













Similar content being viewed by others
References
Aleman-Meza, B., Nagarajan, M., & Ramakrishnan, C. (2006). Semantic analytics on social networks: Experiences in addressing the problem of conflict of interest detection. World Wide Web Conference Communication (pp. 407–416).
Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. (pp. 207–212) New York: Springer.
Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M. M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. International Symposium on String Processing and Information Retrieval (pp. 350–359).
Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. Proceedings of the 2nd international workshop on Information Quality in information Systems. (pp. 69–76).
Han, H., Giles, C. L., & Hong, Y. Z. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital librarie (pp. 296–305).
Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 334–343).
Haykin, S. (1999). Neural networks: A comprehensive foundation.
Huang, J., & Seyda Ertekin, C. L. G. (2006). Efficient name disambiguation for large scale databases. Proc. of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 536–544).
Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions on Database System 31(2):716–767.
Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing and Management 45(1):84–97.
Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences (pp. 99–108).
Koehler, H., Zhou, X., Sadiq, S., Shu, Y., & Taylor, K. (2010). Sampling dirty data for matching attributes. SIGMOD (pp. 63–74).
Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple classifier fusion. Pattern Recognition, 34(2), 299–314.
Orrite, C., Rodriguez, M., Martinez, F., & Fairhurst, M. (2008). Classifier ensemble generation for the majority vote rule. Progress in Pattern Recognition, Image Analysis and Applications (pp. 340–347).
Pedro, D., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3), 103–137.
Pereira, D. A., Ribeiro, B. N., Ziviani, N., Alberto, H. F., Goncalves, A. M., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Proceedings of the 9th ACM/IEEE Joint Conference on Digital Libraries (pp. 49–58).
Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1, 30–34.
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 342–352).
Tan, Y. F., Kan, M. Y., & Lee, D. W. (2006). Search engine driven author disambiguation. 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 314–315).
Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics (pp. 683–697).
Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. H. (2008). Author name disambiguation for citations using topic and web correlation. Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).
Yin, X. X., & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering (pp. 1242–1246)
Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (pp. 718–729).
Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International WWW (pp. 1223–1224).
Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings of a Joint conference on APWeb/WAIM (pp. 320–331)
Zhu, J., Zhou, X. F., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. WISE (pp. 282–289).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, J., Yang, Y., Xie, Q. et al. Robust hybrid name disambiguation framework for large databases. Scientometrics 98, 2255–2274 (2014). https://doi.org/10.1007/s11192-013-1151-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-013-1151-0