Abstract
In bibliometric analysis, ambiguity in author names may lead to erroneous aggregation of records. The author name disambiguation techniques attempt to address this issue by attributing records to the corresponding author. The name disambiguation has been widely studied as a clustering task. However, maintaining consistent accuracy levels over datasets is still a major challenge. Recent efforts have witnessed the use of representation learning based techniques to map the records to an embedding space that can be used to determine the clusters. However, some of these models that use supervised global embedding fail to generalize across different datasets, while others lag in the accuracy. In this paper, we propose a method that uses two independent relations among the documents-co-authorship and meta-content of document, to generate a latent representation of documents that is capable of generalizing over various datasets (consisting different sets of features). Through rigorous validation, we discover that the proposed approach outperforms several state-of-the-art methods by a significant margin in terms of standard measures like pairwise F1, K metric, and BF1 scores. Moreover, we have also validated the performance of our method with the statistical test.
Similar content being viewed by others
Notes
Experimental results of state-of-the-art methods are presented by running code released on the experimental dataset.
References
Ackermann, M. R., & Reitz, F. (2018). Homonym detection in curated bibliographies: Learning from dblp’s experience. In International conference on theory and practice of digital libraries (pp. 59–65). Springer.
Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura Costa, L. (2012). Three-feature model to reproduce the topology of citation networks and the effects from authors visibility on their h-index. Journal of Informetrics, 6(3), 427–434.
Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura Costa, L. (2015). Topological-collaborative approach for disambiguating authors names in collaborative networks. Scientometrics, 102(1), 465–485.
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international conference on World Wide Web (pp. 463–470). ACM.
Cen, L., Dragut, E. C., Si, L., & Ouzzani, M. (2013). Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 741–744). ACM.
Chen, B., Zhang, J., Tang, J., Cai, L., Wang, Z., Zhao, S., Chen, H., & Li, C. (2019). Conna: Addressing name disambiguation on the fly. arXiv:191012202
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 61(9), 1853–1870.
Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ), 2(2), 10.
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278.
Francq, P. (Ed.). (2011). A semi-supervised algorithm to manage communities of interests. In Collaborative search and communities of interest: Trends in knowledge sharing and assessment (pp. 98–133). IGI Global.
Gao, H., Wang, Z., & Ji, S. (2018). Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1416–1424). ACM.
Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital Libraries, 2005. JCDL’05 (pp. 334–343). IEEE.
Halkidi, M., Vazirgiannis, M., & Batistakis, Y. (2000). Quality scheme assessment in the clustering process. In European conference on principles of data mining and knowledge discovery (pp. 265–276). Springer.
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 2004 joint ACM/IEEE conference on Digital Libraries, 2004 (pp. 296–305). IEEE.
Hussain, I., & Asghar, S. (2018). Disc: Disambiguating homonyms using graph structural clustering. Journal of Information Science, 44(6), 830–847.
Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull Soc Vaudoise Sci Nat, 37, 241–272.
Khabsa, M., Treeratpituk, P., & Giles, C. L. (2015). Online person name disambiguation with constraints. In Proceedings of the 15th ACM/IEEE-CS joint conference on Digital Libraries (pp. 37–46). ACM.
Kim, J. (2019). A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 120(2), 661–681.
Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics, 118(1), 253–280.
Kipf, T. N., & Welling, M. (2016). Variational graph auto-encoders. arXiv:161107308
Lapidot, I. (2002). Self-organizing-maps with bic for speaker clustering. IDIAP Technical report.
Lee, J. B., Rossi, R. A., Kong, X., Kim, S., Koh, E., & Rao, A. (2019). Graph convolutional networks with motif-based attention. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 499–508).
Li, S., Cong, G., & Miao, C. (2012). Author name disambiguation using a new categorical distribution similarity. In Machine learning and knowledge discovery in databases (pp. 569–584).
Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised learning. In International conference on knowledge engineering and the semantic web (pp. 272–287). Springer.
Müller, M. C. (2017). Semantic author name disambiguation with word embeddings. In International conference on theory and practice of Digital Libraries (pp. 300–311). Springer.
Müller, M. C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500.
Oliveira, J. W. (2005). A strategy for removing ambiguity in the identification of the authorship of digital objects. Master’s thesis Universidade Federal de Minas Gerais, Brazil in Portuguese.
Pelleg, D., & Moore, A. W. (2000). X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the seventeenth international conference on machine learning, ICML ’00 (pp. 727–734). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://dl.acm.org/citation.cfm?id=645529.657808
Peng, H. T., Lu, C. Y., Hsu, W., & Ho, J. M. (2012). Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications, 39(12), 10521–10532.
Pooja, K., Mondal, S., & Chandra, J. (2019). A graph combination with edge pruning-based approach for author name disambiguation. Journal of the Association for Information Science and Technology, 71, 69–83.
Santana, A. F., Gonçalves, M. A., Laender, A. H., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246.
Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science, 3(1), 11.
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50.
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, BjP., & Wang, K. (2015). An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th international conference on world wide web (pp. 243–246). ACM.
Spielman DA (2007) Spectral graph theory and its applications. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp 29–38
Tang, J., Fong, A. C., Wang, B., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975–987.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In KDD’08 (pp. 990–998).
Thorpe, S. G., Thibeault, C. M., Canac, N., Jalaleddini, K., Dorn, A., Wilk, S. J., et al. (2020). Toward automated classification of pathological transcranial doppler waveform morphology via spectral clustering. PLoS ONE, 15(2), e0228642.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.
Tran, H. N., Huynh, T., & Do, T. (2014). Author name disambiguation by using deep neural network. In Asian conference on intelligent information and database systems (pp. 123–132). Springer.
Van Rijsbergen, C. (1979). Information retrieval (Vol. 14). Dept. of Computer Science, University of Glasgow. https://citeseer.ist.psu.edu/https://vanrijsbergen79information.html
Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W., Jr. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing & Management, 48(4), 680–697.
Viana, M. P., Amancio, D. R., & Costa, Ld. F. (2013). On time-varying collaboration networks. Journal of Informetrics, 7(2), 371–378.
Wang, D., Cui, P., & Zhu, W. (2016). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1225–1234). ACM.
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93(2), 391–411.
Wang, X., & Sukthankar, G. (2014). Link prediction in heterogeneous collaboration networks. In R. Missaoui, & I. Sarr (Eds.), Social network analysis-community detection and evolution (pp. 165–192). Springer.
Wang, X., Tang, J., Cheng, H., & Philip, S. Y. (2011). Adana: Active name disambiguation. In 2011 IEEE 11th international conference on data mining (ICDM) (pp 794–803). IEEE.
Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972.
Xiong, B., Bao, P., & Wu, Y. (2020). Learning semantic and relationship joint embedding for author name disambiguation. Neural Computing and Applications, 33, 1987–1998.
Xu, J., Shen, S., Li, D., & Fu, Y. (2018). A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM international conference on information and knowledge management (pp. 1735–1738). ACM.
Yan, H., Peng, H., Li, C., Li, J., & Wang, L. (2020). Bibliographic name disambiguation with graph convolutional network. In International conference on web information systems engineering (pp. 538–551). Springer.
Zhang, B., & Al Hasan, M. (2017). Name disambiguation in anonymized graphs using network embedding. In Proceedings of the 2017 ACM on conference on information and knowledge management (pp. 1239–1248). ACM.
Zhang, B., Dundar, M., & Al Hasan, M. (2016). Bayesian non-exhaustive classification a case study: Online name disambiguation using temporal record streams. In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1341–1350). ACM.
Zhang, W., Yan, Z., & Zheng, Y. (2019). Author name disambiguation using graph node embedding method. In 2019 IEEE 23rd international conference on computer supported cooperative work in design (CSCWD) (pp. 410–415). IEEE.
Zhang, Y., Zhang, F., Yao, P., & Tang, J. (2018). Name disambiguation in aminer: Clustering, maintenance, and human in the loop. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1002–1011). ACM.
Zheng-Jun, Z., & Yao-Qin, Z. (2009). Estimating the image segmentation number via the entropy gap statistic. In 2009 Second international conference on information and computing science (Vol. 2, pp. 14–16). IEEE.
Acknowledgement
This work was supported by the Visvesvaraya Ph.D. Scheme, Ministry of Electronics and Information Technology, Government of India under Award MEITY-PHD-2517.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pooja, K., Mondal, S. & Chandra, J. Exploiting similarities across multiple dimensions for author name disambiguation. Scientometrics 126, 7525–7560 (2021). https://doi.org/10.1007/s11192-021-04101-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-021-04101-y