The influence of missing publications on the Hirsch index

https://doi.org/10.1016/j.joi.2006.05.001Get rights and content

Abstract

We show that usually the influence on the Hirsch index of missing highly cited articles is much smaller than the number of missing articles. This statement is shown by a combinatorial argument. We further show, by using a continuous power law model, that the influence of missing articles is largest when the total number of publications is small, and non-existing when the number of publications is very large. The same conclusion can be drawn for missing citations. Hence, the h-index is resilient to missing articles and to missing citations.

Introduction

Recently the Hirsch index, in short: h-index, has attracted a lot of attention in the scientific community (Bar-Ilan, 2006, Egghe, in press, Glänzel, 2006, Liang, 2006). This index, introduced by Hirsch (2005) is calculated as follows. Consider the list of publications co-authored by scientist S, ranked according to the number of citations each of these has received over a given period. Then scientist S’ h-index is h if it is the largest natural number such that the first h publications received each at least h citations. Clearly, this definition can also be applied to some other source-item pairs, besides a scientist's publications and citations (Braun et al., 2005; Egghe & Rousseau, 2006; Rousseau, 2006).

In most applications citations have been taken into account only if the corresponding articles have been published in a journal covered by the Web of Knowledge (Thomson Scientific). Yet, it is also possible to collect citations from the Web via Google Scholar (Bar-Ilan, 2006), or from a local database such as the CSCD in China (Liu Zeyuan, personal communication). Expressed in a conglomerate framework this means that the used pool is essential (Rousseau, 2005). It is indeed quite feasible that a scientist's most cited works are published in conference proceedings, free web-journals or, generally, in sources not covered by the Web of Knowledge. What then is the influence of these highly cited articles on a scientist's h-index?

Section snippets

A simple discrete model

It is assumed that the number of missing articles, denoted as m, contains s highly cited ones, this is: articles above the level of the h-index. Secondly, it is assumed that in the initial situation the article at rank h receives exactly h citations. Finally, it is assumed that in the neighbourhood of the original h-index the difference between the numbers of citations received by consecutive articles in the ranking is a fixed number. None of these assumptions is crucial for the point we want

A first example: Citations follow a Zipf distribution

If citations follow a Zipf distribution this means that the number of citations of the source at rank r is equal to Z/r. In this case the h-index is found by solving the equation h = Z/h, hence h=Z. Taking h equal to a natural number means that h is equal to the largest natural number smaller than or equal to Z. This number is known as the floor function of Z denoted as Z. In Table 4 the h-index is calculated for some values of Z, as well as the number of citations (rounded) of the sources at

A second example: Price awardees

Leo Egghe has recently introduced an alternative for the h-index (Egghe, 2006a, Egghe, 2006b, Egghe, 2006c). This is not the subject of this note, but we will use his tables of h-indices of Price medallists to study the influence of missing publications on a scientist's h-index. We will assume that for each of them five highly cited articles are missing and we will recalculate their h-index, based on the data in (Egghe, 2006c). Results are shown in Table 5.

This example shows that for this list

An analytical model based on a power law

In this section we show that a power law, i.e. a Lotka model, as used in an earlier publication (Egghe & Rousseau, 2006) leads to the same conclusion as the combinatorial argument presented above.

In this earlier publication we proved that if citations (or in general: item frequencies) can be described by a negative power law with exponent α > 1, and if the system has T sources, then the h-index (actually its real-valued version, because in this approach the h-index is not a natural number

Conclusion

Contrary to what one might intuitively expect, a relative small number of missing highly cited publications has only a small influence on the value of the h-index. This is usually the case as shown by the examples of citations following a Zipf distribution, and the h-indices of Price medallists. An analytical model reinforces our argument for missing publications, as well as for missing citations. We conclude that the h-index is resilient to missing articles and to missing citations.

Acknowledgements

Research for this note was performed while the author was a guest of WISE-Lab, Dalian University of Technology and of the National Library of Sciences of CAS (Beijing). He thanks Profs. Liu Zeyuan and Jin Bihui for their hospitality. The author further thanks Prof. Leo Egghe (Hasselt University) for a number of helpful suggestions, improving the obtained results. Research for this article was supported by NSFC Grant Nr. 70373055.

References (13)

  • R. Rousseau

    Conglomerates as a general framework for informetric research

    Information Processing and Management

    (2005)
  • J. Bar-Ilan

    H-index for Price medallists revisited

    ISSI Newsletter

    (2006)
  • T. Braun et al.

    A Hirsch-type index for journals

    The Scientist

    (2005)
  • L. Egghe

    Power Laws in the Information Production Process: Lotkaian Informetrics

    (2005)
  • L. Egghe

    How to improve the h-index

    The Scientist

    (2006)
  • L. Egghe

    An improvement of the H-index: the G-index

    ISSI Newsletter

    (2006)
There are more references available in the full text version of this article.

Cited by (44)

  • Measuring the robustness of the journal h-index with respect to publication and citation values: A Bayesian sensitivity analysis

    2016, Journal of Informetrics
    Citation Excerpt :

    Courtault and Hayek (2008) have theoretically shown that a significant number of papers significantly cited must be published to increase the h-index. In the same lines, Rousseau (2007) found, by utilizing theoretical models, that a relative small number of highly cited publications have a small influence on the h-index. According to Minasny, Hartemink, McBratney, and Jang (2013), the h-index is less sensitive to the increase in the number of citations and it does not penalize a journal for publishing a larger number of papers.

  • Strange attractors in the Web of Science database

    2011, Journal of Informetrics
    Citation Excerpt :

    These misdeeds merge with other errors of commission that render stray references in the WoS database in a way that affects all indicators of research performance or impact that are based on citation counts, including h indices. It might be argued that this problem should not affect h indices, given that Rousseau (2007) developed a theoretical argument whereby the h index is robust to missing citations; yet, an empirical study (García-Pérez, in press) has shown that the h index is not that robust in real conditions. The exact magnitude and consequences of phantom citations and strange attractors in WoS is hard to ascertain, but the misdemeanor encourages the use of other platforms for the accrual of complete citation records (see also García-Pérez, in press).

  • h-Index: A review focused in its variants, computation and standardization for different scientific fields

    2009, Journal of Informetrics
    Citation Excerpt :

    Ye and Rousseau (2008) complemented the previous work to find out if power law models for a specific type of h-index time series fit real data sets. Rousseau (2007) has also used a continuous power law model in order to show that the influence of missing articles is largest when the total number of publications is small and non-existing when the number of publications is very large (the same conclusion is drawn for missing citations). Even in Hirsch’s initial proposal, the fact that the h-index cannot directly be used to compare research workers of different areas, mainly due to lack of normalization for reference practices and traditions in the different fields of science (Glanzel & Moed, 2002; Pinski & Narin, 1976) was pointed out.

  • A review on h-index and its alternative indices

    2023, Journal of Information Science
  • A discrete truncated Zipf distribution

    2023, Statistica Neerlandica
View all citing articles on Scopus
View full text