Introduction

Assessment of performance capacity of individual researchers is a critical part of personnel selection (Formann, 1992; Gärtner et al., 2022; Schönbrodt et al., 2022; Vinkler, 1995), promotion as well as tenure decisions in academia (Glover et al., 2012; Shu et al., 2020; Sonne et al., 2019), and taken into account when it comes to funding decisions (Bornmann et al., 2010; Mutz et al., 2017; van den Besselaar & Leydesdorff, 2009). Recently, Mutz and Daniel (2018) proposed the bibliometric quotient to estimate researcher performance capacity as the competency of a researcher to write impactful scholarly papers (Sinatra et al., 2016). The bibliometric quotient is conceptualized as a latent variable measured by differently constructed bibliometric indicators (for latent variable modeling of bibliometric indicators in the spirit of classical test theory, see Forthmann et al., 2024; Mutz, 2022). In particular, they used and extended count data item-response theory (IRT) models based on the negative binomial distribution borrowed from the psychometric literature (Hung, 2012).

Using IRT models for research assessment comes along with several advantages such as more realistic assumptions and quantification of measurement precision (Mutz & Daniel, 2018). In addition, Forthmann and Doebler (2021) have shown that the Conway-Maxwell Poisson counts model (CMPCM; Forthmann et al., 2020a) is another IRT model that is useful for researcher performance assessment. Specifically, they reanalyzed Mutz and Daniel’s (2018) dataset of German social science researchers and found reliability of researcher performance capacity estimates to be excellent (Reliability = 0.973). Mutz and Daniel’s (2018) indicators do include one pure measure of productivity defined as the total number of published scholarly articles and all other indicators potentially benefit from high productivity (annual productivity is also often measured by the number of papers published by a researcher in a year; Yair & Goldstein, 2020). This begs the question to what extent the resulting latent variable estimates are (mere) productivity measures such as number of articles.Footnote 1

It is evident that a researcher must first publish their work in order for it to be cited or subjected to bibliometric evaluation (Helmreich et al., 1980). In the absence of any indications of productivity, there would be no information available for the assessment of individual researchers. This is further supported by the developmental psychological theory of productivity as a manifestation of a researcher’s creative potential (cf. Allison & Stewart, 1974), which can be defined as a scientist’s total number of papers he or she is capable of producing (Simonton, 1984, 1988, 2004). A logical antecedent of creative potential is cognitive ability (Rodgers & Maranto, 1989), which in combination with sustained interest and training in a field results in the expertise required for productivity (Dumas et al., 2024; Rodgers & Maranto, 1989; Simonton, 1984). According to Simonton’s (1984) two-step cognitive process model of productivity it is assumed that individual differences in creative potential explain individual differences in productivity. Empirical findings were in accordance with this idea (Simonton, 1984). However, organizational factors such as a faculty’s research infrastructure are also highly critical for research productivity (Way et al., 2019). In addition, producing something (e.g., publishing a scientific article) is necessary to exert intellectual influence on one's peers (i.e., impact) and for career achievement (Helmreich et al., 1980; Simonton, 1988). Thus, any article that is assessed for a researcher's impact provides information for impact assessment and should be counted for that researcher's productivity (i.e., productivity in this paper is limited to the products that count for bibliometric researcher assessment). Overall, this provides a theoretical basis for the primary aim of the current paper, which is to explain reliable individual differences in a bibliometric measure by individual differences in productivity.

Research productivity has been further proposed as a target dimension of research evaluation itself (Abramo & D’Angelo, 2014; Waltman et al., 2016). Indeed, rankings of productivity in the field of psychology have been published for over a century (Cattell, 1903). Yet, often evaluations of research performance move beyond mere productivity. For example, in patentometrics researchers have called for multifaceted assessment of inventive performance capacity (Caviggioli & Forthmann, 2022; Caviggioli et al., 2020; Lanjouw & Schankerman, 2004) and it is typically challenging to disentangle aspects of patent quality and productivity at the individual level (Caviggioli & Forthmann, 2022). The main challenge here is to reasonably handle the issue of size dependenceFootnote 2 of indicators aiming at aggregating patent quality dimensions for individual inventors. Analogously, bibliometric indicators used for research performance evaluation of individual scientists such as citation counts are affected by size dependence (Forthmann & Szardenings, 2022; Forthmann et al., 2020a; Prathap, 2018) which explains the strong positive correlations found in the literature between citation counts and number of publications at the individual level (Cole & Cole, 1967; Davis, 1987; Feist, 1997; Forthmann et al., 2021).Footnote 3 Hence, size dependence of indicators used for estimating researcher performance capacity again emphasizes the potential role of productivity as an explanatory variable of reliable individual differences between researchers.

In this work, we seek to integrate the important role of productivity into IRT modeling of researcher performance capacity (Forthmann & Doebler, 2021; Mutz & Daniel, 2018) by comparing the reliability of researcher performance capacity estimates across models: Model 1: an IRT model in which productivity is excluded as an item (i.e., a baseline model affected by size-dependence); Model 2: an IRT model in which raw productivity is included as a person covariate (i.e., a naïve model that one might consider for comprehensiveness); Model 3: an IRT model in which log-transformed productivity (cf. Abramo et al., 2010; Sinatra et al., 2016) is included as a person covariate; and Model 4: a variant of Model 3 where the log-transformed productivity is employed as an offset. The model including log-transformed productivity as an offset is well-motivated by the fact that it incorporates the idea underlying size-independent indicators (i.e., using a ratio) into IRT modeling of researcher performance capacity. Since offsets are fixed additive terms, one can view Model 4 as log-linear model with a fixed coefficient of 1 for log-productivity as a person covariate—a special case of Model 3. It is an empirical question if this is a reasonable choice. Hence, we further examined Model 3 in which this coefficient is freely estimated. In addition, competitive models were compared by means of the Bayesian information criterion (BIC; Schwarz, 1978; note that Model 2 and Model 3 are not nested). Finally, we further explored a model with item-specific effects of productivity (Model 5) and a model that includes academic age as another person-covariate (Model 6) relevant to a researcher’s productivity (e.g., Simonton, 1984).

Item response theory count models with productivity as a person covariate

Research assessment has been commonly done based on single indicators such as the h index (Hirsch, 2005), for example. This practice implies a deterministic use of bibliometric indicators (or any other indicator of research performance), which is unrealistic and can be overcome by probabilistic conceptualizations (Glänzel & Moed, 2013). Indeed, it seems unreasonable to equate researcher performance capacity particularly with one (deceivingly exact) value of a bibliometric indicator. This argument seems particular pressing when focusing on research performance evaluation at the level of individual researchers when information is expected to be much less aggregated (i.e., less free of measurement error) as compared to research evaluation at the level of institutions (c.f. Smith, 1981; van Raan, 2005), for example. It is much more realistic to conceptualize researcher performance capacity as an unobserved latent variable (i.e., a variable that cannot directly be measured) that is probabilistically related to observed indicators: The higher researcher performance capacity, the more likely a high indicator score will be. This is exactly the way how IRT conceptualizes the relationship between observed scores and single items (in the context of researcher assessment this refers to different indicators), item parameters such as difficulty (i.e., on some bibliometric indicators it will be easier to get higher scores than on others), and researcher performance capacity. Beyond this, Mutz and Daniel (2018) argued that especially count data IRT models may overcome the following issues associated with common research assessment practice: conceptualization and quantification of measurement precision (i.e., reliability), reasonable modeling of count data indicators, ignorance of evidence for reducing the dimensionality of indicators, and incomparability of indicators across disciplines. Perhaps early IRT applications to bibliometric indicators (Alvarez & Pulgarín, 1996) may have not found much attention in the psychometric literature (Glänzel & Moed, 2013) because not all of these issues were convincingly addressed.

Accounting for productivity within IRT count models

The conditional expected value in IRT count models for bibliometric indicator i and researcher j is modeled as follows (cf. Forthmann & Doebler, 2021):

$${\mu }_{ij}={\sigma }_{i}{\varepsilon }_{j},$$
(1)

with indicator easiness parameter σi (higher values imply higher scores), and researcher performance capacity parameter εj (higher values imply higher scores).

Then, the probability for observing score yij is obtained based on mean parameterized count data distributions such as the Poisson distribution, the negative binomial distribution, or the Conway-Maxwell Poisson (CMP) distribution (cf. Forthmann & Doebler, 2021). For example, the mean parameterized CMP distribution (Huang, 2017; for other parameterizations see, e.g., Guikema & Goffelt, 2008; Sellers & Shmueli, 2010),

$$P\left({Y}_{ij}={y}_{ij}|{\sigma }_{i},{\nu }_{i},{\varepsilon }_{j}\right)=\frac{{\lambda \left({\mu }_{ij},{\nu }_{i}\right)}^{{y}_{ij}}}{{\left({y}_{ij}!\right)}^{{\nu }_{i}}}\frac{1}{Z\left(\lambda \left({\mu }_{ij},{\nu }_{i}\right),{\nu }_{i}\right)},$$
(2)

with indicator-specific dispersion parameter νi, rate parameter \(\lambda \left({\mu }_{ij},{\nu }_{i}\right)\) computed from νi and μij, and normalization constant \(Z\left(\lambda \left({\mu }_{ij},{\nu }_{i}\right),{\nu }_{i}\right)\). Notably, Eq. 2 reduces to the probability mass function of the Poisson distribution when νi = 1 (cf. Huang, 2017). In other words, Rasch’s Poisson count model (Rasch, 1960) is a special case of the CMPCM. In addition, the data are overdispersed (i.e., variance is greater than the mean) when νi < 1 and underdispersed (i.e., variance is smaller than the mean) when νi > 1.

Reliability of researcher performance capacity parameter estimates in IRT count data models (and any other IRT model) can be obtained by means of marginal (empirical) reliability (Brown & Croudace, 2015). Empirical reliability estimates the squared correlation between the estimated and the true researcher performance capacity (the formula is shown below in the methods section). Thus, higher values imply that estimates are closer aligned with their true values with a value of one indicating perfect reliability. More technical details on the CMP distribution and the CMPCM can be found in Huang (2017) and Forthmann et al., (2020a).

It is straightforward to extend Eq. 1 with productivity Tj (e.g., the number of published papers of researcher j) as a person covariate, as in Poisson regression,

$${\mu }_{ij}={\sigma }_{i}{\varepsilon }_{j}{\text{exp}\left({T}_{j}\right)}^{\gamma },$$
(3)

with regression coefficient γ, but we prefer an alternative for several reasons: First, productivity as an item instead of a person-covariate would be modeled via a log-link function (Forthmann & Doebler, 2021; Mutz & Daniel, 2018). Second, productivity distributions of individual researchers are highly skewed and, thus, exp(Tj) would become very large in case of highly productive scientists. The skew is documented since Lotka’s (1926) seminal work on the frequency distribution of scientific productivity (cf. Simonton, 1988). For that reason, researchers have often chosen a log-normal distribution for productivity (e.g., Sinatra et al., 2016), which implies also a log-transformation for productivity at the individual level (cf. Abramo et al., 2010). The Matthew effect (Allison & Stewart, 1974; Feichtinger et al., 2021) is one explanation discussed in the literature for disproportionally large individual productivity, further justifying to curb the effect by a log-transformation. Consequently, one might consider the following alternative model to account for productivity

$${\mu }_{ij}={\sigma }_{i}{\varepsilon }_{j}{{T}_{j}}^{\gamma }$$
(4)

The interpretation of model parameters in Eq. 4 clearly simplifies when γ = 1. In this situation, the researcher performance capacity parameter εj is directly proportional to a researcher’s expected contribution beyond productivity, i.e., εj = μij/Tjσi (cf. Beisemann et al., 2020; Böhning et al., 2015), and in the (count) regression literature, Tj is called an offset. Thus, controlling for researcher productivity in this way would be well in accordance with the logic underlying the journal impact factor (e.g., Garfield, 1972)Footnote 4 and size-independent indicators in general which are defined as ratios (i.e., the proportion of publications with a certain property; cf. https://www.leidenranking.com/information/indicators#size-independent). Analogously, an offset has been used to incorporate field normalization into a multiple-membership Poisson model for impact at the individual level (Mutz & Daniel, 2019).

The models in Eqs. 3 and 4 can further be extended to include item-specific (log) person covariate effects. This can be modelled as

$${\mu }_{ij}={\sigma }_{i}{\varepsilon }_{j}{\text{exp}\left({T}_{j}\right)}^{\begin{array}{c}\gamma +{\gamma }_{i}\end{array}},$$
(5)

for an extension of Eq. 3 and

$${\mu }_{ij}={\sigma }_{i}{\varepsilon }_{j}{{T}_{j}}^{\gamma +{\gamma }_{i}}.$$
(6)

for an extension of Eq. 4. For identification purposes, we need to impose the constraint that \({\sum }_{i=1}^{I}{\gamma }_{i}=0\), where \(I\) is the number of indicators. It can be argued that the influence of productivity may vary between indicators.

The present research

In this work we aim at comparing different models to explicitly incorporate researcher productivity into latent variable models for the estimation of researcher performance capacity. We look into how far these different approaches might affect the reliability of researcher performance capacity estimates by leveraging count data IRT models as this class of models has been recently shown to work well with bibliometric indicators (Forthmann & Doebler, 2021; Mutz & Daniel, 2018). Specifically, we compare an IRT model with item-specific dispersion without productivity as an item used in the model. Next, we aim at explicit modeling of productivity within the extended IRT framework outlined above (i.e., models based on Eqs. 3 and 4). The model including raw productivity (Eq. 3) can be considered as a naïve guess on how productivity should be controlled. This model was mainly included for being comprehensive. With respect to Eq. 4 we consider two variants: a model in which γ was fixed to a value of one (productivity as an offset), and a model in which γ was freely estimated. The offset model nicely transfers the idea of how size-independent indicators are constructed into IRT modeling of researcher performance capacity (cf. Beisemann et al., 2020; Böhning et al., 2015; Mutz & Daniel, 2019). At the end, it is an empirical question if fixing the coefficient for log-productivity to a value of one is justified for a given dataset. Hence, the model in which the coefficient is freely estimated (Eq. 4) should be explored to put the “ratio” model to test. Based on model comparisons using the BIC (Schwarz, 1978), we extend the best-fitting model out of the models which account for productivity to include item-specific productivity effects (i.e., a model either based on Eq. 5 or 6). Finally, we evaluate if academic age (Costas et al., 2010; Kwiek & Roszka, 2022) as an additional person-covariate improves model fit beyond productivity. To ensure the generalizability of the findings, we evaluated the proposed competing models on two datasets consisting of social science researchers.

Method

Datasets

First, we reanalyzed here the dataset used by Mutz and Daniel (2018) which is publicly available (https://doi.org/https://doi.org/10.3929/ETHZ-B-000271425). This dataset was also reanalyzed by Forthmann and Doebler (2021). The dataset includes bibliometric indicators of a total of N = 254 German social scientists. The available bibliometric indicators for researcher performance capacity estimation were (a) TOTCIT (total number of citations), (b) SHORTCIT (citations within 3-years), (c) TOP10% (number of papers from the top 10% in the scientific field), (d) PUBINT (number of collaborative papers with international co-authors), and (e) NUMCIT (number of cited papers). In addition, we used NUMPUB (total number of scholarly articles) as an indicator of productivity.

Secondly, a subset of N = 3956 social science researchers was employed from an openly available dataset by Baas et al. (2021). Due to the presence of researchers with unusually high academic ages (i.e., values exceeding 100), we elected to exclude all researchers with an academic age exceeding 70. This threshold aligns with the approximate biological age of 100 years (Kwiek & Roszka, 2022), a criterion that may be applicable in a few exceptional cases. Consequently, 16 researchers were excluded, representing 0.4% of the social science researcher sample, and the final dataset for analysis comprised N = 3940 researchers. Six of the available bibliometric indicators in this dataset are included in the composite indicator proposed by Ioannidis et al. (2016). Five of these indicators are count data and were included in our analysis (the indicator HM was omitted from the analysis): (a) NC9620 (total number of citations received in the time period from 1996 to 2020), (b) H20 (the h-index at the end of 2020), (c) NCS (total citations received for single authored papers), (d) NCSF (total citations received for single authored and first authored papers), and (e) NCSFL (total citations received for single authored, first authored, and last authored papers). In addition, we used NP6020 (total number of scholarly articles published within the years from 1960 to 2020) as an indicator of productivity.

Model estimation

All models were fitted with the glmmTMB package (Brooks et al., 2017) using the statistical software R (R Core Team, 2023). All code and data are openly available in an OSF repository (https://osf.io/2afdw/).

The mean parameterized CMP distribution and the mean parameterized negative binomial distribution are both implemented in glmmTMB with a log link for the expected value. The dispersion model for the CMPCM is fitted with a log-transformation of the inverse as a link function for the dispersion parameters, i.e., τi = log(1/νi). The negative binomial model utilizes a log-link of the dispersion parameters, i.e., τi = log(φi). All other model parameters in both models are estimated with the transformations θj = log(εj), βi = log(σi). In addition, for the CMPCM, a τi estimate of zero results in the Poisson distribution, whereas τi < 0 and τi > 0 implies underdispersion and overdispersion, respectively. For the negative binomial model, only item-specific overdispersion is modeled. Note that γ does not need to be transformed for both models.

For both models, the researcher performance capacity estimates θj are assumed to follow a normal distributionFootnote 5 with mean zero (for model identification purposes) and standard deviation σθ. Thus, researcher performance capacity is modeled as a random effect in this work. Fixing average researcher performance capacity to zero has further implications for the interpretation of item easiness parameter estimates. Item easiness can be understood as the expected value of an item for researchers with average capacity. All other model parameters, such as item parameters and regression coefficients for person-covariates are fixed effects.

As a baseline model, an IRT model with item-specific dispersion parameters (Model 1) was fitted for the first dataset based on the indicators TOTCIT, SHORTCIT, TOP10%, PUBINT, and NUMCIT, whereas for the second dataset the indicators were NC9620, H20, NCS, NCSF, and NCSFL. Productivity was then simply added as a person-covariate (De Boeck et al., 2011) to fit the model implied by Eq. 3 (Model 2), while for the model implied by Eq. 4 log-transformed productivity was added as a person covariate (Model 3). For the special case with γ = 1, log-transformed productivity was added as an offset to the model (Model 4).

For the first dataset, we used the CMPCM because it has been shown in previous work to fit the data better than a negative binomial model (Forthmann & Doebler, 2021). Initially, we planned to simply use the CMPCM for the second dataset as well, due to the model's flexibility in modeling different dispersion patterns. However, when we attempted to estimate the models for the dataset, the model estimation did not complete after seven days of running on a computer cluster. We explored the data further and observed that significant overdispersion was present for all bibliometric indicators and decided to use a computationally less intensive negative binomial model with item-specific dispersion parameters for the second dataset.

In the present study, we did not employ the well-established approach of likelihood ratio tests for model comparisons. This was despite the fact that previous research utilizing IRT count data models such as the CMPCM or a negative binomial count model has employed this approach (Forthmann & Doebler, 2021; Forthmann et al., 2020a). However, likelihood ratio testing is only applicable when the models under comparison are nested, this implies that one of the two models is derived from the other by constraining at least one of its parameters at a specified value. For example, Model 1 is derived from Model 3 by fixing the parameter γ to a value of zero. However, Model 2 and Model 3 cannot be converted into one another by any such constraints and are thus non-nested. Consequently, given that some of the models were not nested, we employed the Bayesian Information Criterion to evaluate relative model fit in lieu of likelihood ratio testing. Lower values imply better fit to the data, while at the same time model parsimony is taken into account. In addition, we used Raftery’s (1995) heuristic for interpretation of differences in BIC (0–2: weak evidence; 2–6: positive evidence; 6–10: strong evidence; > 10: very strong evidence).

After comparing Models 1 to 4 using the BIC, we selected the best-fitting one out of the four to be extended through item-specific (log) productivity effects (Eq. 5 or 6). To implement the identification constraint for this model (Model 5), we used a sum to 0 constraint (i.e., sum coding). Due to the sum coding, we did not directly get the item-specific (log) productivity effect estimate for the reference item in Model 5 but were able to compute it as \(-{\sum }_{i=2}^{I}\widehat{{\gamma }_{i}}\). The corresponding standard error was computed using the delta method as implemented in the msm package (Jackson, 2011). In the best-fitting model out of Models 1 to 5, we added academic age of the researchers as another person covariate to control for it (Model 6). Finally, marginal (empirical) reliability of the θj estimates was estimated for each model by the same approach used in previous work (Brown & Croudace, 2015; Forthmann & Doebler, 2021). Empirical reliability is estimated by

$$\text{Rel}\left(\theta \right)=1-{\overline{SE}}_{\theta }^{2}/{\widehat{\sigma }}_{\theta }^{2},$$
(7)

with \({\widehat{\sigma }}_{\theta }^{2}\) being the estimated variance of researcher performance capacity and \({\overline{SE}}_{\theta }^{2}\) being the average squared sampling error of the θj estimates.

Given that comparison of empirical reliability estimates across different models was in the focus of this work, we finally aimed at quantifying the uncertainty associated with these estimates. Specifically, we constructed a bootstrapped 95%-CI for empirical reliability estimates. First, we resampled the squared sampling errors \({SE}_{{\theta }_{j}}^{2}\) and obtained \({\overline{SE}}_{\theta }^{2}\) for each of the B samples. Second, the distribution of θj estimates is assumed to be normal and, hence, we obtained B random samples of \({\widehat{\sigma }}_{\theta }^{2}\) by \({X\widehat{\sigma }}_{\theta }^{2}/(n-1)\) with X being drawn from a \({\chi }^{2}\)-distribution with df = n − 1 (Ahn & Fessler, 2003; Holling & Gediga, 2013). Finally, we calculated the 0.025-quantile and the 0.975-quantile for the sampling distribution of empirical reliability obtained from the first two steps. We used B = 100,000 in this work for the bootstrapping procedures.

Notably, this approach does not take into account the potential dependence between estimates of \({\overline{SE}}_{\theta }^{2}\) and \({\widehat{\sigma }}_{\theta }^{2}\). Thus, we expect our approach to be conservative (i.e., confidence intervals are wider as compared to an approach that takes the dependence into account) and any inferences based on the intervals should be treated with caution. However, approaches that explicitly model the potential dependence here such as case resampling combined with model refitting and fully non-parametric bootstrapping (Myszkowski & Storme, 2018; Storme et al., 2019) would be computationally too demanding for the CMPCM which involves approximation of an infinite series. This assertion was tested on a Dell Precision 3551 laptop with the Windows 11 operating system and an x86-64 processor. The time required for fitting Model 1 (i.e., the CMPCM with item-specific dispersion parameters) was determined, as well as the time necessary for obtaining the bootstrap CIs as they are reported in this work. The time required for fitting Model 1 was 3.5 h, while the time required for obtaining the bootstrap CI based on 100,000 bootstrap samples was 2.81 s. Consequently, it would take more than 20 weeks to obtain a non-parametric bootstrap CI by means of case resampling and only 1000 bootstrap samples, for example. Hence, we argue that having an option for quantifying uncertainty (even if that option is suboptimal) is better than having no statistical inference at all. However, because the approach is conservative we cautiously interpret non-overlapping intervals as being indicative of substantial reliability differences between models (Cumming, 2009). For the negative binomial models estimated for the second dataset, we simply employed the same approach for estimating confidence intervals.

Results

First dataset

The estimated model parameters and reliability estimates for researcher performance capacity for all models are depicted in Table 1. Recall that Model 1 does not include productivity as an item for researcher performance capacity estimation. In contrast to earlier models including productivity (Forthmann & Doebler, 2021), estimates of indicator easiness were slightly smaller, yet the order of estimates was highly similar with TOTCIT being the easiest and TOP10% being the most difficult indicator. The dispersion model, however, was quite different. For example, SHORTCIT displayed an estimate of τ = − 0.08 close to the Poisson case of τ = 0 (it displayed strong overdispersion when productivity is also included as an item) and NUMCIT displayed overdispersion instead of underdispersion. Yet, reliability of researcher performance capacity estimates was nearly unaffected by omitting productivity as an item. Model 1 resulted in a reliability of 0.96, 95%-CI: [0.95, 0.97], which is negligibly different from the estimate of 0.97, 95%-CI: [0.96, 0.98],Footnote 6 reported by Forthmann and Doebler (2021).

Table 1 CMP model estimates for the first dataset

According to BIC differences (see Table 1), we found very strong evidence for modeling of productivity as a person covariate (i.e., the BIC of Model 1 was clearly larger by more than 10 as compared to each of the other models). In addition, Model 2 was clearly outperformed by Models 3 and 4 with Model 3 being the best fitting model (among the simpler productivity-control models). In the same vein, we found empirical reliability confidence intervals for Models 2, 3, and 4 to be non-overlapping with the reliability confidence interval for Model 1. This indicated a substantial decrease in reliability when productivity is controlled. We further observed that decreases in indicator easiness parameters, estimates of researcher performance capacity variance, and reliability of researcher performance capacity estimates were all associated with increasing model fit. Model 2 still displayed excellent reliability (0.93), Model 4 displayed only a level of reliability (0.60) that would be considered close to acceptable levels for research purposes (yet still inacceptable for high-stakes assessment contexts; e.g., Ferrando & Lorenzo-Seva, 2018, and the reliability for the best fitting Model 3 (0.47) was far below any acceptable standards. Finally, we observed that Models 3 and 4 displayed highly similar findings for the dispersion parameters. The pattern resembled closely the estimates reported by Forthmann and Doebler (2021). Specifically, SHORTCIT displayed strong overdispersion and NUMCIT displayed underdispersion in these models.

Next, we compared Model 3 with log-productivity as a person covariate with Model 5 that incorporated item-specific deviations from a general effect of log-productivity. We found very strong evidence that Model 5 further improved model fit (BIC decreased clearly by more than 10; cf. Table 1). Specifically, we found that for PUBINT and NUMCIT, the effect of log-productivity was significantly smaller (see Table 1), whereas for SHORTCIT and TOP10% the relationship was negligibly different from the average effect. In addition, the item-specific effect estimate for log-productivity for TOTCIT in Model 5 was computed to be \({\widehat{\gamma }}_{1}=0.37\left(0.04\right), z=10.01, p<.001\). Thus, for TOTCIT the effect of log-productivity was significantly larger as compared to the average effect. The average effect of log-productivity (1.32) did increase as compared to the coefficients in Model 3 (1.17) and Model 4 (fixed to 1.00), whereas item easiness parameters decreased further in Model 5 (vs. the other models in which log-productivity was included). This was particular the case for TOTCIT which was the indicator with the highest item-specific influence of log-productivity and much less pronounced for PUBINT and NUMCIT which were significantly less influenced by log-productivity. We further observed that overdispersion for TOTCIT and SHORTCIT was clearly less strong as compared to the other models including log-productivity as person-covariate. Finally, we observed that reliability of researcher performance capacity estimates was larger for Model 5 (0.56) vs. Model 3 (0.47). Yet, the confidence intervals for both reliability estimates were partially overlapping (see Table 1) and, hence, the observed difference here should be treated with caution.

Finally, adding academic age as another person-covariate (Model 6) did not substantially improve model fit over Model 5 (BICs of both models differed only in terms of decimals). The coefficient for academic age, however, was positive and statistically significant (see Table 1) which implies a slight effect of academic age on research performance capacity beyond item-specific influences of log-productivity modelled in Model 5. The estimated coefficient of 0.01 implies that 10 more academic years result in an expected count that changes by a factor of 1.11 (everything else being constant). The item-specific effect of log-productivity for TOTCIT in Model 6 was computed to be \({\widehat{\gamma }}_{1}=0.38\left(0.04\right),z=10.07,p<.001\). Hence, all model parameters (see also Table 1) as well as reliability of researcher performance capacity estimates (and their confidence intervals) were highly comparable between Model 5 and Model 6. Notably, academic age and productivity were moderately positively correlated, r = 0.39, p < 0.001, 95%-CI: [0.28, 0.49].

Second dataset

The estimated model parameters and reliability estimates for researcher performance capacity for all negative binomial models for the second dataset are depicted in Table 2. In Model 1, the order of estimates of indicator easiness indicated that NC9620 was the easiest and H20 was the most difficult indicator. The dispersion model further indicated that NCSFL displayed the strongest level of overdispersion, while H20 was the least overdispersed indicator. Reliability of researcher performance capacity estimates was excellent, 0.98, 95%-CI: [0.98, 0.98].

Table 2 Negative binomial model estimates for the second dataset

Again, we discovered compelling evidence to suggest that modelling productivity as a person covariate is a highly effective approach (i.e., the BIC of Model 1 was markedly greater than 10 in comparison to each of the other models). Additionally, Model 2 and Model 3 demonstrated superior performance compared to Model 4, while Model 2 exhibited a slight advantage over Model 3. However, in contrast to the findings for the first dataset, the empirical reliability estimates for all models fitted to the second dataset differed by only a third decimal place. Therefore, the observed reduction in reliability when controlling for productivity appeared to be virtually inconsequential (all models exhibited excellent levels of reliability), although slight declines in the indicator's easiness parameters, the estimates of the variance of researcher performance capacity, and the reliability of the estimates of researcher performance capacity were all associated with increasing model fit. Finally, we observed that, with few exceptions, all models yielded highly similar results for dispersion parameters.

Next, we conducted a comparative analysis between Model 2 and Model 5, which incorporated item-specific deviations from a general effect of productivity. The results indicated that Model 5 significantly enhanced the model fit, as evidenced by a notable reduction in the BIC value exceeding 10 (see Table 2). Specifically, the effect of productivity was found to be significantly smaller for NCS and NCSF (see Table 2), whereas for NCSFL the relationship was significantly larger, and for H20 it was not statistically different from the average effect. Furthermore, the item-specific effect estimate for productivity for NC9620 in Model 5 was computed to be \({\widehat{\gamma }}_{1}=0.002 \left(0.000\right),z=30.64,p<.001\). Therefore, the effect of productivity was found to be significantly larger for NC9620 in comparison to the average effect. The average effect of productivity (0.005) exhibited a slight decrease in comparison to the coefficients in Model 2 (0.006). Additionally, it was observed that the reliability of the researcher performance capacity estimates remained consistent at the third decimal point for Model 5 (0.978) and Model 2 (0.978).

Ultimately, incorporating academic age as an additional person-covariate (Model 6) led to a notable enhancement in model fit relative to Model 5. The coefficient for academic age was again positive and statistically significant (see Table 2), indicating that academic age has a slight effect on research performance capacity beyond the item-specific influences of productivity modeled in Model 5. Again, the estimated coefficient of 0.01 implies that 10 more academic years result in an expected count that changes by a factor of 1.11 (everything else being constant). The item-specific effect of productivity for NC9620 in Model 6 was computed to be \({\widehat{\gamma }}_{1}=0.002 \left(0.000\right),z=30.60,p<.001\). Therefore, all model parameters (see also Table 2) and the reliability of researcher performance capacity estimates (and their confidence intervals) were highly comparable between Model 5 and Model 6. Notably, in the second dataset, there was only a small correlation between academic age and productivity, r = 0.19, p < 0.001, 95%-CI: [0.16, 0.22].

Discussion

In this work we combined two lines of research in the scientometrics literature, namely research emphasizing the critical role of productivity for research assessment (Forthmann et al., 2020b; Prathap, 2018; Simonton, 2010) and work with a focus on IRT-based research evaluation (Alvarez & Pulgarín, 1996; Forthmann & Doebler, 2021; Mutz & Daniel, 2018). Specifically, we integrated productivity as a person-covariate into IRT count data models (the CMPCM and a negative binomial count model) and evaluated four different such approaches against a simple IRT model. Most critically, we were interested in how far controlling for productivity affected the reliability of researcher performance capacity estimates. The evaluations were made based on a dataset used in earlier work (Forthmann & Doebler, 2021; Mutz & Daniel, 2018) that considered productivity as an additional item rather than a person-covariate. Accordingly, the findings presented for this dataset in this study can be interpreted in light of previous findings for the same data set. Furthermore, the approach was also applied to a second data set to assess the robustness of the findings obtained for the first data set.

Overall, we found strong evidence that including productivity as a person-covariate improves model fit, while at the same time reliability (on average across models incorporating productivity) decreased along with increasing model fit. However, the decline in reliability observed in the first dataset was substantial and practically meaningful, whereas in the second dataset, it was statistically substantial but practically only negligibly smaller.

For the first dataset, we made two observations that potentially explain the decrease in reliability: (a) estimates of researcher performance capacity variance dropped close to zero for the best fitting models (these variance estimates dropped only slightly for the second dataset), and (b) the overall level of overdispersion present in the indicators was strongest for the models that displayed the lowest degree of reliability (this was on average and to a lesser extent also found for the second dataset). In fact, for reliable assessment of individual differences there must be variance in researcher performance capacity estimates. In addition, stronger overdispersion in the CMPCM implies lower levels of reliability of the latent variable (given that everything else in the model is held constant; Forthmann & Doebler, 2021; Forthmann et al., 2020a). For example, Model 2 displayed still excellent reliability and it had a much larger variance estimate of researcher performance capacity estimate and less overdispersed indicators (i.e., on average) as compared to Models 3 and 4. In Models 3 and 4, however, we found the lowest variance estimates of researcher performance capacity and the highest average levels of overdispersion. Model 4 had somewhat better reliability because of a higher variance estimate (this was also found for the second dataset), while dispersion levels seemed to be more comparable between the models. Model 5 (the best fitting model) incorporated item-specific effects of productivity as a person-covariate and had less pronounced overdispersion as compared to Model 3. Consequently, reliability of researcher performance capacity estimates was higher for a model with item-specific effects of productivity as compared to a model with a general effect of productivity. The findings for the second dataset were generally much more homogeneous in all of these aspects discussed in relation to the findings obtained for the first dataset.

Thus, despite the conceptual attractiveness of Model 4 as a model that provides researcher performance capacity estimates that can be understood as latent person contribution to bibliometric indicator ratios (a logic that underlies the ubiquitous journal impact factor), we had to conclude that Model 5 provided the best fit for both datasets. It should be noted that, with regard to the first dataset, log-productivity was entered into Model 5, whereas with regard to the second dataset, productivity was entered without transformation. So, the general finding across datasets was the need for indicator-specific coefficients for productivity (transformed or not). Forth the first dataset it was evident that Model 5 yielded a level of reliability that was still below any acceptable standards for either research purposes or high-stakes assessment situations (Ferrando & Lorenzo-Seva, 2018). Yet, reliability for Model 3 was much worse. For research purposes reliability should be at least 0.64 (please note that Model 4 and Model 5 came close to this cut-off), whereas for high stakes assessment it should be at least 0.81 (Models 1 and 2 clearly surpass this level). However, this cut-off for high-stakes assessment was clearly surpassed by all estimated models for the second dataset.

Finally, it is also noteworthy that the addition of academic age to the model did not result in a significant improvement in fit beyond that achieved by including productivity as a person-covariate in the first dataset. However, this was clearly the case for the second dataset. This was critical to investigate because in most assessment contexts with focus on researcher performance capacity of individual scholars, persons will be active in a given field for varying amounts of time. More time simply allows one to conduct more research which in turn might results in more academic writing experience. However, our results suggest that academic age was not an explanatory factor of reliable individual differences in research performance capacity beyond productivity. In the first dataset, reliability even increased slightly for Model 6 as compared to the best fitting Model 5 in which academic age was potentially only controlled indirectly via the moderate overlap with productivity. The inclusion of academic age in the second dataset resulted in an improved model fit; however, it explained only a practically negligible proportion of reliable individual differences in researcher performance capacity. Still, we argue that academic age should always be considered as a relevant variable in related future research.

Should we accept productivity as an essential part of researcher performance capacity?

One way to read the findings of the current work is to accept productivity as an integral part of bibliometric assessment of researcher performance capacity. Such a conceptualization would not necessarily call for statistical control of productivity and it results in excellent levels of reliability of researcher performance capacity (Forthmann & Doebler, 2021; Mutz & Daniel, 2018). This view might be further justified by the fact that without productivity (i.e., researchers with no visible outputs at all) all other assessments would simply not be possible (Helmreich et al., 1980). Only productive researchers may have their output evaluated for other quality criteria and individual differences in productivity can be understood as the basis of individual differences in anything else. In this vein, one might accept productivity as a manifestation of creative potential (Simonton, 1984) and, hence, as a basis for researcher performance capacity that explains reliability levels towards acceptable standards. Individual differences in productivity are likely to originate from a combination of individual factors (e.g., cognitive ability) and context (e.g., field, institutional resources). Explanatory count IRT models are natural candidates for modelling individual differences. However, a well-suited dataset would require a relatively elaborate data collection.

Otherwise, we might conclude that reliability did not drop to zero when productivity is accounted for in the first dataset and reliability was still excellent after controlling for productivity in all models for the second dataset. Thus, reliability of individual differences cannot only be attributed to productivity. Other aspects captured by the studied indicators in the first dataset such as impact or international visibility of a scholar have had clearly their role in the overall level of researcher performance capacity. If one would wish to strengthen this measurement approach more non-redundant indicators that also measure impact and/or international visibility would be needed. This way, it is expected that reliability of researcher performance capacity estimates would increase up to acceptable levels. In a related vein, the excellent reliability estimates obtained for the second dataset, even after controlling for productivity, can be understood when considering how the indicators were constructed. The indicators NCS, NCSF, and NCSFL capture redundancy, which was intentionally introduced by the authors to weigh the information captured by these indicators more strongly in a composite (cf. Ioannidis et al., 2016). This may result in reliable individual differences, even after controlling for productivity. It would be beneficial for future research to investigate the number of additional indicators that would be required to accurately assess researcher performance beyond productivity. Furthermore, it would be advantageous to determine which indicators should be considered to achieve this goal.

Importantly, the idea of controlling for productivity as a person-covariate in IRT-based research assessment seems only vital, when the indicators (i.e., the items) in the IRT model are constructed based on the same set of products as it was the case for almost all indicators in the first dataset (only SHORTCIT was based on a subset of NUMPUB). Therefore, IRT-based approaches may not necessitate the incorporation of productivity control measures when indicators are derived from disjunct product sets (e.g., when productivity is conceptualized much broader by considering books, articles, software, patents, invited talks, and so forth). The results obtained for the second dataset provide empirical evidence that is consistent with this assertion. For instance, the indicators NC9620 and H20 consider all publications by the studied researchers from 1960 to 2020, whereas NCS, NCSF, and NCSFL concentrate on specific subsets of these publications. This may be another reason why integrating productivity into the IRT model for the second dataset did not result in a practically significant decline in reliability compared to the first dataset. Alternatively, one could use annual output (Yair & Goldstein, 2020). For example, Forthmann and Doebler (2021) compared different IRT approaches for the output of inventors over disjunct periods of time as items. While this somehow circumvents the discussion on productivity inherent to the current work, it does indeed not guarantee excellent reliability. They found a reliability of 0.69 for the best fitting model to estimate inventor performance capacity. We recommend future work to focus on more disjunct sets of bibliometric (and other) indicators for researcher performance capacity estimates within IRT frameworks.

Limitations and future work

Importantly, the IRT models used in this work implicitly assume that all indicators discriminate equally well between researchers in terms of researcher performance capacity. Extensions of the CMPCM (i.e., the 2PCMPM and extensions; Beisemann, 2022; Beisemann et al., 2022) allow modeling indicator-specific discrimination. The low variance in latent abilities in some of the examined models for the first dataset suggest that if a common-across-indicators discrimination would have been estimated (for a fixed latent variance), that common discrimination would have been quite low as well. Freeing discriminations to vary across indicators would allow to investigate which indicators are particularly affected by such a drop in discriminatory power and if there are any which still retain some discrimination between researchers. Fitting such generalization of the CMPCM would also allow for explicitly testing the assumption that discrimination is the same across all indicators. The same discussion applies to the negative binomial model used in this work for the second dataset. While models with varying discrimination have been tested in the literature (Mutz & Daniel, 2018), this has not yet been combined with productivity as an explanatory variable.

We attempted to fit corresponding extensions to the CMPCM models examined in this work, however, we ran into some numerical trouble for the data at hand. For instance, for the model without person covariates, we observed very large latent ability variance with the CMPCM—this translates into very large discriminations in the 2PCMPM for which the sample size of this study would not be sufficient to provide unbiased estimates. Further, for count responses in the 1000s, numerical instabilities in the Expectation–Maximization algorithm used to estimate the 2PCMPM occurred (this was likely related to interpolation methods used in the algorithm; see Beisemann, 2022, for more technical details on the algorithm). In models with covariates we additionally would have to handle continuous covariates which are challenging for the algorithm in terms of computation time and feasibility (see Beisemann et al., 2022, for more details). Addressing these computational challenges was beyond the scope of the present work but might certainly be interesting for future research to be able to allow for indicator-specific discrimination in this context. Another computational challenge is the functional relationship of productivity as a covariate. While we compare linear, log-linear and offset here, spline models also known as additive models (e.g., Wood, 2017) could help refine this part of the model. However, given the computational challenges only very few IRT models incorporate splines, e.g., Brunn et al. (2022).

Furthermore, we focused on reliability as a main outcome in this work and for the quantification of uncertainty in reliability estimates we had to rely on a rather pragmatic—yet conservative—bootstrap approach. This approach did not take into account the potential dependence between estimates of \({\overline{SE}}_{\theta }^{2}\) and \({\widehat{\sigma }}_{\theta }^{2}\), because approaches such as non-parametric bootstrap combined with a case resampling procedure (Myszkowski & Storme, 2018; Storme et al., 2019) cannot easily be implemented for the CMPCM. The CMPCM involves computation of an infinite sum and refitting the models for only 1000 bootstrap samples are expected to take weeks even on a high-performance computer cluster. Consequently, we decided to report conservative inference based on a pragmatic yet reasonable approach and recommend cautious interpretation (especially when intervals were only slightly overlapping).

Furthermore, we acknowledge that research assessment based on impact-oriented bibliometric indicators as used for illustration in this work is currently hotly debated (Ramani et al., 2022; Schönbrodt et al., 2022). Yet, when focusing on research assessment of individual researchers it is most likely that scholarly papers will remain the basis of such an evaluation. The used indicators may change though (Gärtner et al., 2022; Schönbrodt et al., 2022). However, given that scholarly papers remain the basis for researcher performance assessment, we argue that the proposed modeling approaches for bibliometric indicators will straightforwardly generalize to other indicators. Beyond being limited to bibliometric indicators, the current work is somewhat limited to other characteristics of the used datasets (e.g., the sample of researchers). Hence, we call for applications of our proposed approaches to other datasets. In light of these limitations, however, we argue that the empirical findings in this work provide strong proof of concept on how we may integrate productivity as a central variable into the conceptualization of research assessment.

Conclusion

Researchers productivity is the basis for any research evaluation. Multiple researchers hinted at the critical role of productivity in quantity-quality models of scientific productivity (Caviggioli & Forthmann, 2022; Forthmann et al., 2020b; Prathap, 2018; Simonton, 2010) and in this work we have clearly demonstrated that productivity has a central role in IRT-based research evaluation. We argue that we should understand the measured construct as a mix of productivity, impact, and reputation (or other aspects targeted by the indicators) when productivity is part of the indicator set, rely on more or different indicators compared to the current study when the goal is to assess researchers' performance capacity more independently of productivity (i.e., to minimize the risk inherent in overemphasizing productivity, which is that researchers start churning out stuff just to have a high number of publications; cf. Bornmann & Tekles, 2019), or construct indicators that do not rely on the same products (e.g., annual output). We hope to have paved the way for other researchers to find sophisticated justifiable models for a variety of research assessment contexts.