Skip to main content
Log in

Exploiting similarities across multiple dimensions for author name disambiguation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In bibliometric analysis, ambiguity in author names may lead to erroneous aggregation of records. The author name disambiguation techniques attempt to address this issue by attributing records to the corresponding author. The name disambiguation has been widely studied as a clustering task. However, maintaining consistent accuracy levels over datasets is still a major challenge. Recent efforts have witnessed the use of representation learning based techniques to map the records to an embedding space that can be used to determine the clusters. However, some of these models that use supervised global embedding fail to generalize across different datasets, while others lag in the accuracy. In this paper, we propose a method that uses two independent relations among the documents-co-authorship and meta-content of document, to generate a latent representation of documents that is capable of generalizing over various datasets (consisting different sets of features). Through rigorous validation, we discover that the proposed approach outperforms several state-of-the-art methods by a significant margin in terms of standard measures like pairwise F1, K metric, and BF1 scores. Moreover, we have also validated the performance of our method with the statistical test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/yaya213/DBLP-Name-Disambiguation-Dataset.

  2. http://clgiles.ist.psu.edu/data/.

  3. Experimental results of state-of-the-art methods are presented by running code released on the experimental dataset.

References

  • Ackermann, M. R., & Reitz, F. (2018). Homonym detection in curated bibliographies: Learning from dblp’s experience. In International conference on theory and practice of digital libraries (pp. 59–65). Springer.

  • Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura Costa, L. (2012). Three-feature model to reproduce the topology of citation networks and the effects from authors visibility on their h-index. Journal of Informetrics, 6(3), 427–434.

    Article  Google Scholar 

  • Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura Costa, L. (2015). Topological-collaborative approach for disambiguating authors names in collaborative networks. Scientometrics, 102(1), 465–485.

    Article  Google Scholar 

  • Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international conference on World Wide Web (pp. 463–470). ACM.

  • Cen, L., Dragut, E. C., Si, L., & Ouzzani, M. (2013). Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 741–744). ACM.

  • Chen, B., Zhang, J., Tang, J., Cai, L., Wang, Z., Zhao, S., Chen, H., & Li, C. (2019). Conna: Addressing name disambiguation on the fly. arXiv:191012202

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 61(9), 1853–1870.

    Google Scholar 

  • Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ), 2(2), 10.

    Google Scholar 

  • Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278.

    Article  Google Scholar 

  • Francq, P. (Ed.). (2011). A semi-supervised algorithm to manage communities of interests. In Collaborative search and communities of interest: Trends in knowledge sharing and assessment (pp. 98–133). IGI Global.

  • Gao, H., Wang, Z., & Ji, S. (2018). Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1416–1424). ACM.

  • Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital Libraries, 2005. JCDL’05 (pp. 334–343). IEEE.

  • Halkidi, M., Vazirgiannis, M., & Batistakis, Y. (2000). Quality scheme assessment in the clustering process. In European conference on principles of data mining and knowledge discovery (pp. 265–276). Springer.

  • Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 2004 joint ACM/IEEE conference on Digital Libraries, 2004 (pp. 296–305). IEEE.

  • Hussain, I., & Asghar, S. (2018). Disc: Disambiguating homonyms using graph structural clustering. Journal of Information Science, 44(6), 830–847.

    Article  Google Scholar 

  • Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull Soc Vaudoise Sci Nat, 37, 241–272.

    Google Scholar 

  • Khabsa, M., Treeratpituk, P., & Giles, C. L. (2015). Online person name disambiguation with constraints. In Proceedings of the 15th ACM/IEEE-CS joint conference on Digital Libraries (pp. 37–46). ACM.

  • Kim, J. (2019). A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 120(2), 661–681.

    Article  Google Scholar 

  • Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics, 118(1), 253–280.

    Article  Google Scholar 

  • Kipf, T. N., & Welling, M. (2016). Variational graph auto-encoders. arXiv:161107308

  • Lapidot, I. (2002). Self-organizing-maps with bic for speaker clustering. IDIAP Technical report.

  • Lee, J. B., Rossi, R. A., Kong, X., Kim, S., Koh, E., & Rao, A. (2019). Graph convolutional networks with motif-based attention. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 499–508).

  • Li, S., Cong, G., & Miao, C. (2012). Author name disambiguation using a new categorical distribution similarity. In Machine learning and knowledge discovery in databases (pp. 569–584).

  • Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised learning. In International conference on knowledge engineering and the semantic web (pp. 272–287). Springer.

  • Müller, M. C. (2017). Semantic author name disambiguation with word embeddings. In International conference on theory and practice of Digital Libraries (pp. 300–311). Springer.

  • Müller, M. C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500.

    Article  Google Scholar 

  • Oliveira, J. W. (2005). A strategy for removing ambiguity in the identification of the authorship of digital objects. Master’s thesis Universidade Federal de Minas Gerais, Brazil in Portuguese.

  • Pelleg, D., & Moore, A. W. (2000). X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the seventeenth international conference on machine learning, ICML ’00 (pp. 727–734). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://dl.acm.org/citation.cfm?id=645529.657808

  • Peng, H. T., Lu, C. Y., Hsu, W., & Ho, J. M. (2012). Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications, 39(12), 10521–10532.

    Article  Google Scholar 

  • Pooja, K., Mondal, S., & Chandra, J. (2019). A graph combination with edge pruning-based approach for author name disambiguation. Journal of the Association for Information Science and Technology, 71, 69–83.

    Google Scholar 

  • Santana, A. F., Gonçalves, M. A., Laender, A. H., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246.

    Article  Google Scholar 

  • Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science, 3(1), 11.

    Article  Google Scholar 

  • Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50.

    Article  Google Scholar 

  • Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, BjP., & Wang, K. (2015). An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th international conference on world wide web (pp. 243–246). ACM.

  • Spielman DA (2007) Spectral graph theory and its applications. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp 29–38

  • Tang, J., Fong, A. C., Wang, B., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975–987.

    Article  Google Scholar 

  • Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In KDD’08 (pp. 990–998).

  • Thorpe, S. G., Thibeault, C. M., Canac, N., Jalaleddini, K., Dorn, A., Wilk, S. J., et al. (2020). Toward automated classification of pathological transcranial doppler waveform morphology via spectral clustering. PLoS ONE, 15(2), e0228642.

    Article  Google Scholar 

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.

    Article  MathSciNet  Google Scholar 

  • Tran, H. N., Huynh, T., & Do, T. (2014). Author name disambiguation by using deep neural network. In Asian conference on intelligent information and database systems (pp. 123–132). Springer.

  • Van Rijsbergen, C. (1979). Information retrieval (Vol. 14). Dept. of Computer Science, University of Glasgow. https://citeseer.ist.psu.edu/https://vanrijsbergen79information.html

  • Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W., Jr. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing & Management, 48(4), 680–697.

    Article  Google Scholar 

  • Viana, M. P., Amancio, D. R., & Costa, Ld. F. (2013). On time-varying collaboration networks. Journal of Informetrics, 7(2), 371–378.

    Article  Google Scholar 

  • Wang, D., Cui, P., & Zhu, W. (2016). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1225–1234). ACM.

  • Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93(2), 391–411.

    Article  Google Scholar 

  • Wang, X., & Sukthankar, G. (2014). Link prediction in heterogeneous collaboration networks. In R. Missaoui, & I. Sarr (Eds.), Social network analysis-community detection and evolution (pp. 165–192). Springer.

  • Wang, X., Tang, J., Cheng, H., & Philip, S. Y. (2011). Adana: Active name disambiguation. In 2011 IEEE 11th international conference on data mining (ICDM) (pp 794–803). IEEE.

  • Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972.

    Article  Google Scholar 

  • Xiong, B., Bao, P., & Wu, Y. (2020). Learning semantic and relationship joint embedding for author name disambiguation. Neural Computing and Applications, 33, 1987–1998.

    Article  Google Scholar 

  • Xu, J., Shen, S., Li, D., & Fu, Y. (2018). A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM international conference on information and knowledge management (pp. 1735–1738). ACM.

  • Yan, H., Peng, H., Li, C., Li, J., & Wang, L. (2020). Bibliographic name disambiguation with graph convolutional network. In International conference on web information systems engineering (pp. 538–551). Springer.

  • Zhang, B., & Al Hasan, M. (2017). Name disambiguation in anonymized graphs using network embedding. In Proceedings of the 2017 ACM on conference on information and knowledge management (pp. 1239–1248). ACM.

  • Zhang, B., Dundar, M., & Al Hasan, M. (2016). Bayesian non-exhaustive classification a case study: Online name disambiguation using temporal record streams. In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1341–1350). ACM.

  • Zhang, W., Yan, Z., & Zheng, Y. (2019). Author name disambiguation using graph node embedding method. In 2019 IEEE 23rd international conference on computer supported cooperative work in design (CSCWD) (pp. 410–415). IEEE.

  • Zhang, Y., Zhang, F., Yao, P., & Tang, J. (2018). Name disambiguation in aminer: Clustering, maintenance, and human in the loop. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1002–1011). ACM.

  • Zheng-Jun, Z., & Yao-Qin, Z. (2009). Estimating the image segmentation number via the entropy gap statistic. In 2009 Second international conference on information and computing science (Vol. 2, pp. 14–16). IEEE.

Download references

Acknowledgement

This work was supported by the Visvesvaraya Ph.D. Scheme, Ministry of Electronics and Information Technology, Government of India under Award MEITY-PHD-2517.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to KM. Pooja.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pooja, K., Mondal, S. & Chandra, J. Exploiting similarities across multiple dimensions for author name disambiguation. Scientometrics 126, 7525–7560 (2021). https://doi.org/10.1007/s11192-021-04101-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-021-04101-y

Keywords

Navigation