Elsevier

Journal of Informetrics

Volume 13, Issue 1, February 2019, Pages 299-313
Journal of Informetrics

Regular article
Globalised vs averaged: Bias and ranking performance on the author level

https://doi.org/10.1016/j.joi.2019.01.006Get rights and content

Highlights

  • We compare the averaged and globalised aggregation approaches on the author level.

  • We analyse the differences based on various normalised paper-level metrics.

  • The differences in field bias are marginal.

  • The globalised variant of percentile scores identifies influential researchers better.

  • The differences between the variants for citation scores are generally insignificant.

Abstract

We analyse the difference between the averaged (average of ratios) and globalised (ratio of averages) author-level aggregation approaches based on various paper-level metrics. We evaluate the aggregation variants in terms of (1) their field bias on the author-level and (2) their ranking performance based on test data that comprises researchers that have received fellowship status or won prestigious awards for their long-lasting and high-impact research contributions to their fields. We consider various direct and indirect paper-level metrics with different normalisation approaches (mean-based, percentile-based, co-citation-based) and focus on the bias and performance differences between the two aggregation variants of each metric. We execute all experiments on two publication databases which use different field categorisation schemes. The first uses author-chosen concept categories and covers the computer science literature. The second covers all disciplines and categorises papers by keywords based on their contents. In terms of bias, we find relatively little difference between the averaged and globalised variants. For mean-normalised citation counts we find no significant difference between the two approaches. However, the percentile-based metric shows less bias with the globalised approach, except for citation windows smaller than four years. On the multi-disciplinary database, PageRank has the overall least bias but shows no significant difference between the two aggregation variants. The averaged variants of most metrics have less bias for small citation windows. For larger citation windows the differences are smaller and are mostly insignificant.

In terms of ranking the well-established researchers who have received accolades for their high-impact contributions, we find that the globalised variant of the percentile-based metric performs better. Again we find no significant differences between the globalised and averaged variants based on citation counts and PageRank scores.

Introduction

Citation metrics constitute a key tool in scientometrics and play an increasingly important role in the evaluation of researchers (Bornmann, 2017). To enable fair evaluations, it is a de facto requirement that metrics are field and time normalised (Waltman, 2016). On the paper level, a paper's actual score is usually compared to the expected score computed from a reference set comprising papers from the same field and published in the same year. Normalised paper impact scores may then be aggregated to define author-level impact metrics. Two aggregation approaches exist that use the actual and expected scores of papers. The first computes the average of each paper's ratio of actual and expected scores, which is often referred to as the ‘average of ratios’ or averaged approach (Waltman, 2016). The second divides the sum of an author's actual paper scores by the sum of the corresponding expected paper scores. The latter is also referred to as the ‘ratio of averages’ or globalised approach (Egghe & Rousseau, 1996). Opinions differ as to which approach is better suited or more appropriate for the evaluation of academic entities (Egghe & Rousseau, 1996; Lundberg, 2007; Moed, 2010; Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011a). In this paper we take a quantitative look at the differences between these two aggregation approaches.

Since no gold standard for normalised metrics exists (Bornmann & Marx, 2018), we use a number of different paper-level impact metrics that use different normalisation strategies to overcome the bias introduced through varying citation potentials between research fields and time. For instance, we use mean-normalised citation scores where a paper's citation count is compared to the mean (expected) citation count of papers published in its field and in the same year (Lundberg, 2007; Radicchi, Fortunato, & Castellano, 2008). We also use a percentile metric where a paper is rated in terms of its percentile in the score distribution of papers in the same field and with the same publication year (Bornmann, Leydesdorff, & Mutz, 2013; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). We also look at indirect metrics (Giuffrida, Abramo, & D’Angelo, 2018; Pinski & Narin, 1976) and a metric where a paper's co-cited papers (papers that are cited together with it) are used as the reference set to compute expected paper scores (Hutchins, Yuan, Anderson, & Santangelo, 2016).

An important task in scientometrics is to validate that metrics fulfil their intended purpose. According to Bornmann and Marx (2018), situations should be created or found in empirical research in which a metric can fail to achieve its purpose. A metric should only be regarded as provisionally valid if these situations could not be found or realised. For example, metrics that are intended to rate the quality of papers should be assessed by correlating them with peer assessments (Bornmann & Marx, 2015). This also applies to author-level evaluations. However, collecting direct peer-assessed test data is time consuming and expensive. We therefore use a proxy for this assessment which comprises test data based on awards and other recognitions that researchers have received for their outstanding contributions in their fields (Dunaiski, Geldenhuys, & Visser, 2018a; Dunaiski, Visser, & Geldenhuys, 2016; Fiala, Šubelj, Žitnik, & Bajec, 2015; Fiala, 2012; Fiala, Rousselot, & Ježek, 2008; Fiala & Tutoky, 2017; Gao, Wang, Li, Zhang, & Zeng, 2016; Nykl, Campr, & Ježek, 2015; Nykl, Ježek, Fiala, & Dostal, 2014). Specifically, we use selected researchers that have won prizes for their highly influential and long-lasting contributions and researchers that have been awarded the ACM fellowship for similar achievements.

We follow the appeal by Bornmann and Marx (2018) for continued scrutiny of current proposals and analyse the difference between the averaged and globalised variants of various paper-level metrics along two dimensions: (1) their fairness to rank authors across fields, and (2) their performance in ranking the well-established researchers comprising our test data set. We compare the overall bias and performance of the metrics but focus on the differences between the averaged and globalised variants for each paper-level metric.

We conduct all experiments on two publication databases. The first database is the ACM Digital Library (ACM, Inc, 2014) which provides a Computing Classification System (CCS) that consists of a library-like, hierarchical structure of concepts. Authors may assign their papers to one or more concepts in this classification hierarchy. We use the CCS to categorise papers and authors into subfields of the computer science discipline. The second database is the Microsoft Academic Graph (MAG) database (Microsoft, 2017). It is multi-disciplinary and papers are assigned to fields in a hierarchical structure based on keywords extracted from their texts. We use the top-level fields as paper categories which roughly capture the scientific disciplines such as ‘Mathematics’ or ‘Medicine’. Again, we categorise authors into disciplines based on their published work.

With this paper we make the following contributions:

  • We analyse the averaged and globalised aggregation approaches on the author-level using two different field classification schemes. The first is a categorisation where the authors chose their papers’ categories (ACM database). The second is based on semantic information contained within titles and abstracts (MAG database).

  • We consider a range of paper-level metrics and show that for some metrics the choice between using the averaged or the globalised approach is important and impacts the author-level metric's field bias as well as its performance in identifying well-established researchers.

  • We analyse the bias and performance of the variants over a range of citation window sizes (1–25 years). We find that the choice between the aggregation variants depends less on citation windows sizes. However, the differences between metrics change substantially with different citation windows.

In this paper we first provide the reader with background information about normalisation factors and focus on the arguments for or against the averaged and globalised approaches (Section 2). In Section 3, we describe the methodology of evaluating the metrics along the bias and performance dimensions. We present the results in Section 4, followed by a discussion of the results in Section 5.

Section snippets

Paper-level normalisation

One of the key principles of bibliometrics is that entities from different fields should not be compared directly based on total citations counts. This stems from the observation that citation densities (mean citation counts) vary between fields due to their different sizes and publication cultures ([Lundberg, 2007], [Radicchi et al., 2008]). Citation densities may even vary between narrow subfields within the same discipline (van Leeuwen & Calero Medina, 2012). In addition, citation counts of

Publication databases

We use two publication databases for the experiments described in this paper. The first is a 2015 version of the ACM Digital Library (ACM, Inc, 2014). It contains papers up to March 2015 that are published in periodicals and proceedings from the field of computer science. The ACM uses a categorisation scheme which is called the Computing Classification System (CCS) where each paper is associated with one or more concepts that are organised in a poly-hierarchical structure (ACM, Inc, 2017b).

Results

We use the different paper-level metrics discussed in Section 2.3 and aggregate them to the author level by using the three aggregation approaches discussed in Section 2.1. Therefore, each metric has three variants: (1) the size-dependent (total) variant which is the sum of an author's paper scores, (2) the average of ratios (averaged) variant, and (3) the ratio of averages (globalised) variant.

We evaluate the author metrics on their field bias (Section 4.1) and ranking performance (Section

Discussion

The Abramo method. As mentioned before, comparing the results of the Abramo method to citation counts may yield some insight into how the indirect impact can influence the performance and bias of author scores. The Abramo method defines a paper's score as its citation count C plus the score obtained from indirect citations which ranges between 0 and C. Therefore, the actual score of a paper ranges between C and 2C, where 2C is achieved if all citing papers are themselves the most cited papers

Threats to validity and future work

In Section 4.1 we computed the bias of metrics for differently sized citation windows t. For this computation we created the null models based on all authors that have received a score at time t1 + t, where t1 is the year in which an author first published a paper. The score of an author is therefore based on a citation graph that contains all papers and citations up to the year t1 + t. Since t1 varies among authors, the author scores at time t are based on citation graphs with different time

Conclusion

In this paper, we compared the averaged (average of ratios) and the globalised (ratio of averages) aggregation approaches on the author level. We used different time- and field-normalised paper-level metrics which we, once aggregated to the author level, evaluated in terms of field bias and performance. We evaluated performance based on how well the metrics rank the authors of a test data set comprising well-established researchers who have received accolades for their impactful and

Author contributions

Marcel Dunaiski: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.

Jaco Geldenhuys: Other contribution.

References (60)

  • D. Fiala et al.

    Do PageRank-based author rankings outperform simple citation counts?

    Journal of Informetrics

    (2015)
  • D. Fiala et al.

    PageRank-based prediction of award-winning researchers and the impact of citations?

    Journal of Informetrics

    (2017)
  • V. Larivière et al.

    Averages of ratios vs. ratios of averages: An empirical analysis of four levels of aggregation

    Journal of Informetrics

    (2011)
  • X. Liu et al.

    Co-authorship networks in the digital library research community?

    Information Processing & Management

    (2005)
  • J. Lundberg

    Lifting the crown-citation z-score?

    Journal of Informetrics

    (2007)
  • M.S. Mariani et al.

    Identification of milestone papers through time-balanced network centrality?

    Journal of Informetrics

    (2016)
  • H.F. Moed

    CWTS crown indicator measures citation impact of a research group's publication oeuvre?

    Journal of Informetrics

    (2010)
  • M. Nykl et al.

    Author ranking based on personalized PageRank?

    Journal of Informetrics

    (2015)
  • M. Nykl et al.

    PageRank variants in the evaluation of citation networks?

    Journal of Informetrics

    (2014)
  • T. Opthof et al.

    Caveats for the journal and field normalizations in the CWTS (“leiden”) evaluations of research performance?

    Journal of Informetrics

    (2010)
  • G. Pinski et al.

    Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics

    Information Processing & Management

    (1976)
  • F. Radicchi et al.

    Testing the fairness of citation indicators for comparison across scientific domains: The case of fractional citation counts?

    Journal of Informetrics

    (2012)
  • J. Ruiz-Castillo et al.

    Field-normalized citation impact indicators using algorithmically constructed classification systems of science?

    Journal of Informetrics

    (2015)
  • F.N. Silva et al.

    Quantifying the interdisciplinarity of scientific journals and fields?

    Journal of Informetrics

    (2013)
  • D. Sirtes

    Finding the easter eggs hidden by oneself: Why Radicchi and Castellano's (2012) fairness test for citation indicators is not fair?

    Journal of Informetrics

    (2012)
  • L. Smolinsky

    Expected number of citations and the crown indicator?

    Journal of Informetrics

    (2016)
  • G. Vaccario et al.

    Quantifying and suppressing ranking bias in a large citation network?

    Journal of Informetrics

    (2017)
  • P. Vinkler

    The case of scientometricians with the “absolute relative” impact indicator?

    Journal of Informetrics

    (2012)
  • L. Waltman

    A review of the literature on citation impact indicators?

    Journal of Informetrics

    (2016)
  • L. Waltman et al.

    Towards a new crown indicator: Some theoretical considerations?

    Journal of Informetrics

    (2011)
  • Cited by (7)

    • Unbiased evaluation of ranking metrics reveals consistent performance in science and technology citation data

      2020, Journal of Informetrics
      Citation Excerpt :

      Additionally, we provide results also for the percentile-based citation count which is commonly used in bibliometrics (Bornmann, Leydesdorff, & Mutz, 2013; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). Expert-selected nodes have been used before to evaluate rankings of authors (Dunaiski, Geldenhuys, & Visser, 2019a; Radicchi et al., 2009), rankings of movies (Ren, Mariani, Zhang, & Medo, 2018; Wasserman, Zeng, & Amaral, 2015), rankings of scientific papers (Dunaiski, Geldenhuys, & Visser, 2019b; Mariani et al., 2016), and rankings of court cases (Fowler & Jeon, 2008), for example (see Dunaiski, Geldenhuys, & Visser, 2018) for a recent in-depth discussion of this evaluation approach). We make here an important methodological distinction by distinguishing two similar, yet fundamentally different ranking tasks:

    View all citing articles on Scopus
    View full text