When analyzing data, researchers have to make a multitude of decisions that affect the results of their research. Since there is often more than one justifiable approach to conducting data analysis, different (teams of) analysts may potentially arrive at different results when tasked with answering the same research question. Recent evidence shows the potential extent of uncertainty in empirical research: In a study by Schweinsberg et al. (2021), 19 analysts were given the same data and research question, but arrived at a broad range of results because of different analytical approaches and divergent operationalizations of key variables. The findings of Schweinsberg et al. (2021) and similar studies (Breznau et al., 2022; Huntington-Klein et al., 2021; Silberzahn et al., 2018) are highly relevant for indicator-based research like bibliometrics, because the specification of bibliometric indicators often comes with substantial degrees of freedom.Footnote 1 The possibility that different specifications of the same bibliometric indicator might lead to different research outcomes poses a potentially serious threat to the credibility of bibliometric research. In this Letter to the Editor, we use the disruption index (DI1) as an example to illustrate the causes and consequences of the analytical flexibility of indicator-based bibliometric research and we encourage the application of multiverse-style methods to increase the transparency as well as the robustness of bibliometric analyses.

Shortly after Funk and Owen-Smith (2017) introduced the DI1 as a measure of technological change,Footnote 2 Wu et al. (2019) recognized its potential for the bibliometric study of transformative science. The DI1 has started a new stream of research and plays a central role in no less than three Nature articles (Lin et al. 2023a; Park et al., 2023; Wu et al., 2019) and numerous other publications. When we summarized the literature on the DI1 for our recently published review article (Leibel & Bornmann, 2024), we noticed that the calculation of disruption scores comes with numerous (hidden) degrees of freedom.

The DI1 is closely related to measures of betweenness centrality (Freeman, 1977; Gebhart & Funk, 2023) and uses bibliographic coupling links to quantify historical discontinuities in the citation network of a focal paper (FP) (Leydesdorff & Bornmann, 2021). Bibliographic coupling links connect publications that cite the same references. The DI1 ranges from − 1 to 1 and is equivalent to the following ratio:

$${\text{DI}}_{1}=\frac{{N}_{F}-{N}_{B}}{{N}_{F}+{N}_{B}+{N}_{R}}$$

\({N}_{B}\) is the number of citing papers that contain at least one bibliographic coupling link with the FP. These papers represent historical continuity because they connect the more recent literature with literature that predates the FP. Conversely, \({N}_{F}\) quantifies historical discontinuities by counting the number of papers that cite the FP without citing any of the FP’s cited references. A large \({N}_{F}\) signals that the ideas that inspired the FP are no longer relevant for future research. Because \({N}_{B}\) is subtracted from \({N}_{F}\) in the numerator of the DI1, positive disruption scores indicate that the FP “overshadows” (Liu et al., 2023) previous research. Negative disruption scores indicate that previous research still remains relevant after the publication of the FP. \({N}_{R}\) is the number of papers that cite the FP’s cited references without citing the FP itself. Compared to \({N}_{B}\) and \({N}_{F}\), it is less clear what \({N}_{R}\) is supposed to represent. The purpose of \({N}_{R}\) may be to compare the citation impact of the FP to the citation impact of its cited references. \({N}_{R}\) reduces the disruption score of a FP considerably if the sum of the citations received by the cited publications (cited references) exceeds the citation count of the FP.

The calculation of the DI1 hints at degrees of freedom that are often neglected in the literature. Since the justifiable modifications of the DI1 are too numerous to discuss them all in this Letter to the Editor (there is an increasing multiverse of modifications), we refer the interested reader to our literature review (Leibel & Bornmann, 2024) where we provide an overview of the various alternatives to the DI1 researchers have suggested so far. For the sake of brevity, we limit the illustration of the analytical flexibility of the DI1 to three examples. First, Bornmann et al. (2020) point out that the DI1 contains an implicit threshold \(X\), such that a citing paper only counts towards \({N}_{B}\) if it cites at least \(X\) of the FP’s cited references. For the DI1, \(X\) = 1, but one could just as well choose \(X\) > 1 with the argument that stronger bibliographic coupling links are a better indicator of historical continuity. Second, when calculating disruption scores it is common practice to use citation windows of \(Y\) years. Third, the exclusion of FPs with less than \(Z\) cited references (and/or citations) may help to avoid data artefacts. Table 1 explains the significance of \(X\), \(Y\), and \(Z\) in detail and lists examples from the current literature.

Table 1 Examples of degrees of freedom in the calculation of the DI1

If all combinations of \(X\), \(Y\), and \(Z\) are justifiable, one would expect that all results support the same conclusion. However, as long as the set of possible results is not calculated and reported in its entirety, there is no way of knowing whether the results are consistent. In the case of the DI1 and its variants, empirical evidence shows that the strength of the bibliographic coupling for \({N}_{B}\) (Bittmann et al., 2022; Bornmann & Tekles, 2021; Bornmann et al., 2020; Deng & Zeng, 2023; Wang et al., 2023), the length of the citation window (Bornmann & Tekles, 2019a; Liang et al., 2022), or the treatment of data artefacts (Holst et al., 2024; Liang et al., 2022; Ruan et al., 2021) can substantially affect research outcomes.

For the purpose of illustration, we present the average disruption scores for a set of 77 Nobel Prize winning papers (NPs) published between 1985 and 2000. We collected the NPs from the dataset provided by Li et al. (2019) and used the Max Planck Society’s inhouse version of the Web of Science (Clarivate) to calculate the disruption scores. We considered five choices for \(X\) (1, 2, 3, 4, 5), three choices for \(Y\) (3 years, 5 years, 10 years), and three choices for \(Z\) (1 reference, 5 references, 10 references). Because each combination of \(X\), \(Y\), and \(Z\) leads to a unique outcome, there is total of \(5\times 3\times 3=45\) different results. Figure 1 shows both the range and the density distribution of the 45 results.Footnote 3 The majority of the average disruption scores (the average across the 77 NPs) are greater than zero, but some are smaller than zero. Even though the example is limited to \(X\), \(Y\), and \(Z\), and does not consider additional factors that may affect disruption scores, it still unveils a broad range of possible results: The average disruption scores of the NPs range from −0.034 (\(X=1\), \(Y=3\), \(Z=1\)) to 0.123 (\(X=5\), \(Y=10\), \(Z=1\)).

Fig. 1
figure 1

Kernel density distribution of the average disruption scores of 77 Nobel Prize winning papers published between 1985 and 2000 depending on the minimum number of bibliographic coupling links (\(X\)), the length of the citation window (\(Y\)), and the minimum number of cited references (\(Z\))

In light of the risk that research outcomes may vary greatly across unique combinations of \(X\), \(Y\), and \(Z\) (as well as other factors), it seems problematic that in standard research practice researchers typically present analyses and results based on just one or maybe a few different specifications of the DI1. Empirical results and their policy implications could hinge on arbitrarily chosen specifications of bibliometric indicators that are no more or less defensible than alternative specifications. If this uncertainty cannot be eliminated, it should at least be acknowledged. In empirical social research, it is standard practice to communicate and quantify the risk that samples may be unrepresentative in the form of standard errors, confidence intervals, and \(p\) values. Similar, in bibliometrics, the credibility of research would profit from acknowledging that a result achieved with a specific variant of a bibliometric index may not be representative of the entire range of results that can be achieved with alternative indicator specifications.

We now turn to the methodological consequences of our observations. The widespread convention of presenting a main analysis and (maybe) a few robustness checks means that the reader of a bibliometric study gets to see results based on only a very limited number of indicator specifications. Sometimes, researchers may have good reasons for their selection of indicators. However, for want of convincing theoretical or statistical reasons to prefer any particular variant of an indicator to alternative specifications, analysts find themselves faced with a large set of justifiable indicators. In such a scenario, the important question is not “Which is the best indicator?” but rather “Which set of indicators deserves consideration?” (Young, 2018). This set—the multiverse of equally valid indicator specifications—“directly implies a multiverse of statistical results” (Steegen et al., 2016, p. 702), which should be reported in its entirety. This can be achieved with contemporary computational power, as was demonstrated by Muñoz and Young (2018). The authors show that widely used statistics software (like Stata) can both run and visualize several billion regressions.

Researchers from different disciplines have developed several multiverse-style methods like multiverse analysis (Steegen et al., 2016), multimodel analysis (Young & Holsteen, 2017), specification-curve analysis (Simonsohn et al., 2020), and vibration of effects analysis (Patel et al., 2015). All these methods build on the same core principle: Empirical results are not trustworthy if justifiable changes to the research strategy drastically alter the conclusions of a study. Multiverse-style methods can be thought of as very extensive and systematic robustness checks. Both conventional robustness checks and multiverse-style methods are guided by the notion that “a fragile inference is not worth taking seriously” (Leamer, 1985, p. 308). Only multiverse-style methods take this notion to its logical conclusion by transparently communicating the entire multiverse of statistical results. In the case of bibliometric indicators, this means that, ideally, all variants of an indicator that are supposed to capture the same concept (equally well) should lead to results that support the same conclusion.

We limit the focus of this Letter to the Editor to the multiverse of bibliometric indicators, but other aspects of the (bibliometric) research process like data collection (Harder, 2020), data processing (Steegen et al., 2016), and the specification of statistical models each come with their own multiverse. In empirical research, results are not only driven by raw data, but also by the decisions of the researchers that collect and analyze the data. The resulting uncertainty of research outcomes is, in the words of Chatfield (1995, p. 419), simply “a fact of life”. It is not a unique feature of bibliometric indicators. We believe that multiverse-style methods may be of particular relevance for bibliometrics because bibliometric indicators, as we have exemplarily demonstrated for the DI1, tend to have numerous variants. The variants create large multiverses that often remain unreported, e.g., due to the limitations of conventional robustness checks.

In this Letter to the Editor, we used the DI1 as an example to illustrate the degrees of freedom that come with the specification of a bibliometric indicator. In a current research project, we are working on an empirical multiverse-style analysis to investigate the robustness of DI1 scores. We believe that similar lines of argument may be applied to other bibliometric indicators like, e.g., interdisciplinarity indicators. Wang and Schneider (2020, p. 239) analyzed the consistency of interdisciplinarity measures and found “surprisingly deviant results when comparing measures that supposedly should capture similar features or dimensions of the concept of interdisciplinarity”. Multiverse-style methods could be used to quantify the uncertainty of results in bibliometric interdisciplinarity research—as well as other streams of research—in order to find and eliminate sources of uncertainty. By unveiling the decision nodes required to calculate bibliometric indicators, multiverse-style methods could pave the way towards more robust indicator-based research.