Regular articleGlobalised vs averaged: Bias and ranking performance on the author level
Introduction
Citation metrics constitute a key tool in scientometrics and play an increasingly important role in the evaluation of researchers (Bornmann, 2017). To enable fair evaluations, it is a de facto requirement that metrics are field and time normalised (Waltman, 2016). On the paper level, a paper's actual score is usually compared to the expected score computed from a reference set comprising papers from the same field and published in the same year. Normalised paper impact scores may then be aggregated to define author-level impact metrics. Two aggregation approaches exist that use the actual and expected scores of papers. The first computes the average of each paper's ratio of actual and expected scores, which is often referred to as the ‘average of ratios’ or averaged approach (Waltman, 2016). The second divides the sum of an author's actual paper scores by the sum of the corresponding expected paper scores. The latter is also referred to as the ‘ratio of averages’ or globalised approach (Egghe & Rousseau, 1996). Opinions differ as to which approach is better suited or more appropriate for the evaluation of academic entities (Egghe & Rousseau, 1996; Lundberg, 2007; Moed, 2010; Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011a). In this paper we take a quantitative look at the differences between these two aggregation approaches.
Since no gold standard for normalised metrics exists (Bornmann & Marx, 2018), we use a number of different paper-level impact metrics that use different normalisation strategies to overcome the bias introduced through varying citation potentials between research fields and time. For instance, we use mean-normalised citation scores where a paper's citation count is compared to the mean (expected) citation count of papers published in its field and in the same year (Lundberg, 2007; Radicchi, Fortunato, & Castellano, 2008). We also use a percentile metric where a paper is rated in terms of its percentile in the score distribution of papers in the same field and with the same publication year (Bornmann, Leydesdorff, & Mutz, 2013; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). We also look at indirect metrics (Giuffrida, Abramo, & D’Angelo, 2018; Pinski & Narin, 1976) and a metric where a paper's co-cited papers (papers that are cited together with it) are used as the reference set to compute expected paper scores (Hutchins, Yuan, Anderson, & Santangelo, 2016).
An important task in scientometrics is to validate that metrics fulfil their intended purpose. According to Bornmann and Marx (2018), situations should be created or found in empirical research in which a metric can fail to achieve its purpose. A metric should only be regarded as provisionally valid if these situations could not be found or realised. For example, metrics that are intended to rate the quality of papers should be assessed by correlating them with peer assessments (Bornmann & Marx, 2015). This also applies to author-level evaluations. However, collecting direct peer-assessed test data is time consuming and expensive. We therefore use a proxy for this assessment which comprises test data based on awards and other recognitions that researchers have received for their outstanding contributions in their fields (Dunaiski, Geldenhuys, & Visser, 2018a; Dunaiski, Visser, & Geldenhuys, 2016; Fiala, Šubelj, Žitnik, & Bajec, 2015; Fiala, 2012; Fiala, Rousselot, & Ježek, 2008; Fiala & Tutoky, 2017; Gao, Wang, Li, Zhang, & Zeng, 2016; Nykl, Campr, & Ježek, 2015; Nykl, Ježek, Fiala, & Dostal, 2014). Specifically, we use selected researchers that have won prizes for their highly influential and long-lasting contributions and researchers that have been awarded the ACM fellowship for similar achievements.
We follow the appeal by Bornmann and Marx (2018) for continued scrutiny of current proposals and analyse the difference between the averaged and globalised variants of various paper-level metrics along two dimensions: (1) their fairness to rank authors across fields, and (2) their performance in ranking the well-established researchers comprising our test data set. We compare the overall bias and performance of the metrics but focus on the differences between the averaged and globalised variants for each paper-level metric.
We conduct all experiments on two publication databases. The first database is the ACM Digital Library (ACM, Inc, 2014) which provides a Computing Classification System (CCS) that consists of a library-like, hierarchical structure of concepts. Authors may assign their papers to one or more concepts in this classification hierarchy. We use the CCS to categorise papers and authors into subfields of the computer science discipline. The second database is the Microsoft Academic Graph (MAG) database (Microsoft, 2017). It is multi-disciplinary and papers are assigned to fields in a hierarchical structure based on keywords extracted from their texts. We use the top-level fields as paper categories which roughly capture the scientific disciplines such as ‘Mathematics’ or ‘Medicine’. Again, we categorise authors into disciplines based on their published work.
With this paper we make the following contributions:
- •
We analyse the averaged and globalised aggregation approaches on the author-level using two different field classification schemes. The first is a categorisation where the authors chose their papers’ categories (ACM database). The second is based on semantic information contained within titles and abstracts (MAG database).
- •
We consider a range of paper-level metrics and show that for some metrics the choice between using the averaged or the globalised approach is important and impacts the author-level metric's field bias as well as its performance in identifying well-established researchers.
- •
We analyse the bias and performance of the variants over a range of citation window sizes (1–25 years). We find that the choice between the aggregation variants depends less on citation windows sizes. However, the differences between metrics change substantially with different citation windows.
In this paper we first provide the reader with background information about normalisation factors and focus on the arguments for or against the averaged and globalised approaches (Section 2). In Section 3, we describe the methodology of evaluating the metrics along the bias and performance dimensions. We present the results in Section 4, followed by a discussion of the results in Section 5.
Section snippets
Paper-level normalisation
One of the key principles of bibliometrics is that entities from different fields should not be compared directly based on total citations counts. This stems from the observation that citation densities (mean citation counts) vary between fields due to their different sizes and publication cultures ([Lundberg, 2007], [Radicchi et al., 2008]). Citation densities may even vary between narrow subfields within the same discipline (van Leeuwen & Calero Medina, 2012). In addition, citation counts of
Publication databases
We use two publication databases for the experiments described in this paper. The first is a 2015 version of the ACM Digital Library (ACM, Inc, 2014). It contains papers up to March 2015 that are published in periodicals and proceedings from the field of computer science. The ACM uses a categorisation scheme which is called the Computing Classification System (CCS) where each paper is associated with one or more concepts that are organised in a poly-hierarchical structure (ACM, Inc, 2017b).
Results
We use the different paper-level metrics discussed in Section 2.3 and aggregate them to the author level by using the three aggregation approaches discussed in Section 2.1. Therefore, each metric has three variants: (1) the size-dependent (total) variant which is the sum of an author's paper scores, (2) the average of ratios (averaged) variant, and (3) the ratio of averages (globalised) variant.
We evaluate the author metrics on their field bias (Section 4.1) and ranking performance (Section
Discussion
The Abramo method. As mentioned before, comparing the results of the Abramo method to citation counts may yield some insight into how the indirect impact can influence the performance and bias of author scores. The Abramo method defines a paper's score as its citation count C plus the score obtained from indirect citations which ranges between 0 and C. Therefore, the actual score of a paper ranges between C and 2C, where 2C is achieved if all citing papers are themselves the most cited papers
Threats to validity and future work
In Section 4.1 we computed the bias of metrics for differently sized citation windows t. For this computation we created the null models based on all authors that have received a score at time t1 + t, where t1 is the year in which an author first published a paper. The score of an author is therefore based on a citation graph that contains all papers and citations up to the year t1 + t. Since t1 varies among authors, the author scores at time t are based on citation graphs with different time
Conclusion
In this paper, we compared the averaged (average of ratios) and the globalised (ratio of averages) aggregation approaches on the author level. We used different time- and field-normalised paper-level metrics which we, once aggregated to the author level, evaluated in terms of field bias and performance. We evaluated performance based on how well the metrics rank the authors of a test data set comprising well-established researchers who have received accolades for their impactful and
Author contributions
Marcel Dunaiski: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.
Jaco Geldenhuys: Other contribution.
References (60)
- et al.
The use of percentiles and percentile rank classes in the analysis of bibliometric data: Opportunities and limits?
Journal of Informetrics
(2013) - et al.
Methods for the generation of normalized citation impact scores in bibliometrics: Which method best reflects the judgements of experts?
Journal of Informetrics
(2015) - et al.
Critical rationalism and the search for standard (field-normalized) indicators in bibliometrics?
Journal of Informetrics
(2018) - et al.
The anatomy of a large-scale hypertextual web search engine
Proceedings of the Seventh International Conference on World Wide Web, WWW ’07
(1998) - et al.
Finding scientific gems with Google's PageRank algorithm?
Journal of Informetrics
(2007) - et al.
The effects and their stability of field normalization baseline on relative performance with respect to citation impact: A case study of 20 natural science departments?
Journal of Informetrics
(2011) - et al.
Author ranking evaluation at scale?
Journal of Informetrics
(2018) - et al.
How to evaluate rankings of academic entities using test data?
Journal of Informetrics
(2018) - et al.
Evaluating paper and author ranking algorithms using impact and contribution awards?
Journal of Informetrics
(2016) Time-aware PageRank for bibliographic networks?
Journal of Informetrics
(2012)
Do PageRank-based author rankings outperform simple citation counts?
Journal of Informetrics
PageRank-based prediction of award-winning researchers and the impact of citations?
Journal of Informetrics
Averages of ratios vs. ratios of averages: An empirical analysis of four levels of aggregation
Journal of Informetrics
Co-authorship networks in the digital library research community?
Information Processing & Management
Lifting the crown-citation z-score?
Journal of Informetrics
Identification of milestone papers through time-balanced network centrality?
Journal of Informetrics
CWTS crown indicator measures citation impact of a research group's publication oeuvre?
Journal of Informetrics
Author ranking based on personalized PageRank?
Journal of Informetrics
PageRank variants in the evaluation of citation networks?
Journal of Informetrics
Caveats for the journal and field normalizations in the CWTS (“leiden”) evaluations of research performance?
Journal of Informetrics
Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics
Information Processing & Management
Testing the fairness of citation indicators for comparison across scientific domains: The case of fractional citation counts?
Journal of Informetrics
Field-normalized citation impact indicators using algorithmically constructed classification systems of science?
Journal of Informetrics
Quantifying the interdisciplinarity of scientific journals and fields?
Journal of Informetrics
Finding the easter eggs hidden by oneself: Why Radicchi and Castellano's (2012) fairness test for citation indicators is not fair?
Journal of Informetrics
Expected number of citations and the crown indicator?
Journal of Informetrics
Quantifying and suppressing ranking bias in a large citation network?
Journal of Informetrics
The case of scientometricians with the “absolute relative” impact indicator?
Journal of Informetrics
A review of the literature on citation impact indicators?
Journal of Informetrics
Towards a new crown indicator: Some theoretical considerations?
Journal of Informetrics
Cited by (7)
Unbiased evaluation of ranking metrics reveals consistent performance in science and technology citation data
2020, Journal of InformetricsCitation Excerpt :Additionally, we provide results also for the percentile-based citation count which is commonly used in bibliometrics (Bornmann, Leydesdorff, & Mutz, 2013; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). Expert-selected nodes have been used before to evaluate rankings of authors (Dunaiski, Geldenhuys, & Visser, 2019a; Radicchi et al., 2009), rankings of movies (Ren, Mariani, Zhang, & Medo, 2018; Wasserman, Zeng, & Amaral, 2015), rankings of scientific papers (Dunaiski, Geldenhuys, & Visser, 2019b; Mariani et al., 2016), and rankings of court cases (Fowler & Jeon, 2008), for example (see Dunaiski, Geldenhuys, & Visser, 2018) for a recent in-depth discussion of this evaluation approach). We make here an important methodological distinction by distinguishing two similar, yet fundamentally different ranking tasks:
Network-based ranking in social systems: Three challenges
2020, Journal of Physics: Complexity