Skip to main content
Log in

The Sci-Hub effect on papers’ citations

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Citations are often used as a metric of the impact of scientific publications. Here, we examine how the number of downloads from Sci-Hub as well as various characteristics of publications and their authors predicts future citations. Using data from 12 leading journals in economics, consumer research, neuroscience, and multidisciplinary research, we found that articles downloaded from Sci-Hub were cited 1.72 times more than papers not downloaded from Sci-Hub and that the number of downloads from Sci-Hub was a robust predictor of future citations. Among other characteristics of publications, the number of figures in a manuscript consistently predicts its future citations. The results suggest that limited access to publications may limit some scientific research from achieving its full impact.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

Our data sets as well as the codes that we developed for the analyses are available in the following public repository. https://osf.io/8c632/?view_only=19ea965dd02449a0927a3d95d0132a55.

References

  • Adler, R., Ewing, J., & Taylor, P. (2009). Citation statistics: a report from the international mathematical union (imu) in cooperation with the international council of industrial and applied mathematics (iciam) and the institute of mathematical statistics (ims). Statistical Science, 24(1), 1–14.

    MathSciNet  MATH  Google Scholar 

  • Andročec, D. (2017). Analysis of Sci-Hub downloads of computer science papers. Acta Universitatis Sapientiae Informatica, 9(1), 83–96. https://doi.org/10.1515/ausi-2017-0006.

    Article  Google Scholar 

  • Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21.

    Google Scholar 

  • Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6), 1086-1120. https://doi.org/10.1016/j.leaqua.2010.10.010.

    Article  Google Scholar 

  • Armstrong, M. (2015). Opening access to research. Economic Journal, 125(586), F1–F30.

    Article  Google Scholar 

  • Bendezú-Quispe, G., Nieto-Gutiérrez, W., Pacheco-Mendoza, J., & Taype-Rondan, A. (2016). Sci-Hub and medical practice: an ethical dilemma in Peru. The Lancet Global Health, 4(9), e608.

    Article  Google Scholar 

  • Berg, J., Bhalla, N., Bourne, P., Chalfie, M., Drubin, D., Fraser, J., et al. (2016). Preprints for the life sciences. Science, 352(6288), 899–901.

    Article  Google Scholar 

  • Bjrk, B.C., & Solomon, D. (2012). Open access versus subscription journals: A comparison of scientific impact. BMC Medicine, 10, https://doi.org/10.1186/1741-7015-10-73.

  • Bohannon, J. (2016). Who’s downloading pirated papers? everyone. Science, 352(6285), 508–512.

    Article  Google Scholar 

  • Bohannon, J., & Elbakyan, A. (2016). Data from: Whos downloading pirated papers? everyone. Dryad Digital Repository,. https://doi.org/10.5061/dryad.q447c.

    Article  Google Scholar 

  • Boudry, C., Alvarez-Muñoz, P., Arencibia-Jorge, R., Ayena, D., Brouwer, N. J., Chaudhuri, Z., et al. (2019). Worldwide inequality in access to full text scientific articles: the example of ophthalmology. PeerJ, 7, e7850.

    Article  Google Scholar 

  • Boukacem-Zeghmouri, C., Bador, P., Lafouge, T., & Prost, H. (2016). Relationships between consumption, publication and impact in french universities in a value perspective: a bibliometric analysis. Scientometrics, 106(1), 263–280.

    Article  Google Scholar 

  • Breitzman, A., & Thomas, P. (2015). Inventor team size as a predictor of the future citation impact of patents. Scientometrics, 103(2), 631–647.

    Article  Google Scholar 

  • Brody, T., Harnad, S., & Carr, L. (2006). Earlier web usage statistics as predictors of later citation impact. Journal of the American Society for Information Science and Technology, 57(8), 1060–1072.

    Article  Google Scholar 

  • Bhlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3), 404–426. https://doi.org/10.1214/19-STS721.

    Article  MathSciNet  Google Scholar 

  • Chen, X. (2016). A Middle-of-the-Road Proposal amid the Sci-Hub Controversy: Share “Unofficial” Copies of Articles without Embargo, Legally. Publications 4(29), https://doi.org/10.3390/publications4040029

  • Deshpande, P. R. (2019). Why should Sci-Hub be supported? International Journal of Health and Allied Sciences, 8(3), 210–212. https://doi.org/10.4103/ijhas.IJHAS_91_18.

    Article  Google Scholar 

  • Faust, J. S. (2016). Sci-Hub A Solution to the Problem of Paywalls, or Merely a Diagnosis of a Broken System? Annals of Emergency Medicine, 68(1), 15A–17A. https://doi.org/10.1016/j.annemergmed.2016.05.010.

    Article  Google Scholar 

  • Garcia-Puente, M., Pastor-Ramon, E., Agirre, O., Moran, J. M., & Herrera-Peco, I. (2019). The use of Sci-Hub in systematic reviews of the scholarly literature. Clinical Implant Dentistry and Related Research, 21(5), 816. https://doi.org/10.1111/cid.12815.

    Article  Google Scholar 

  • Gonzalez-Solar, L. & Fernandez-Marcial, V. (2019). Sci-Hub, a challenge for academic and research libraries. Profesional de la Informacin 28(1), https://doi.org/10.3145/epi.2019.ene.12.

  • Greco, A. N. (2017). The Kirtsaeng and SCI-HUB Cases: The Major US Copyright Cases in the Twenty-First Century. Publishing Research Quarterly, 33(3), 238–253. https://doi.org/10.1007/s12109-017-9522-7.

    Article  Google Scholar 

  • Hausmann, R., Hidalgo, C., Bustos, S., Coscia, M., Simoes, A., & Yildrim, M. (2013). The atlas of economic complexity: mapping paths to prosperity. Cambridge: MIT Press.

    Google Scholar 

  • Hegarty, P., & Walton, Z. (2012). The consequences of predicting scientific impact in psychology using journal impact factors. Perspectives on Psychological Science, 7(1), 72–78.

    Article  Google Scholar 

  • Himmelstein, D. S., Romero, A. R., Levernier, J. G., Munro, T. A., McLaughlin, S. R., Tzovaras, B. G., et al. (2018). Sci-hub provides access to nearly all scholarly literature. ELife, 7(e32), 822.

    Google Scholar 

  • Horowitz, I. (1986). Scientific access and political constraint to knowledge: Revisiting the dilemma of rights and obligations. Science Communication, 7(4), 397–405. https://doi.org/10.1177/107554708600700404.

    Article  Google Scholar 

  • Jaffe, K., Caicedo, M., Manzanares, M., Gil, M., Rios, A., Florez, A., et al. (2013). Productivity in physical and chemical science predicts the future economic growth of developing countries better than other popular indices. PLoS ONE, 8(6), e66239.

    Article  Google Scholar 

  • Jaffé, R. (2019). #Pay4Reviews: Academic publishers should pay scientists for peer-review. PeerJ Preprints, 7, e27,573v1.

    Google Scholar 

  • Laverde-Rojas, H., & Correa, J. C. (2019). Can scientific productivity impact the economic complexity of countries? Scientometrics, 120(1), 267–282.

    Article  Google Scholar 

  • Lee, H. A., Law, R., & Ladkin, A. (2014). What makes an article citable? Current Issues in Tourism, 17(5), 455–462.

    Article  Google Scholar 

  • Lewbel, A. (2012). Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. Journal of Business & Economic Statistics, 30(1), 67–80.

    Article  MathSciNet  Google Scholar 

  • Machin-Mastromatteo, J. D., Uribe-Tirado, A., & Romero-Ortiz, M. E. (2016). Piracy of scientific papers in Latin America: An analysis of Sci-Hub usage data. Information Development, 32(5), 1806–1814. https://doi.org/10.1177/0266666916671080.

    Article  Google Scholar 

  • Manley, S. (2019). On the limitations of recent lawsuits against Sci-Hub, OMICS, ResearchGate, and Georgia State University. Learned Publishing, 32(4), 375–381. https://doi.org/10.1002/leap.1254.

    Article  Google Scholar 

  • McNutt, M. (2016). My love-hate of Sci-Hub. Science (New York, NY), 352(6285), 497. https://doi.org/10.1126/science.aaf9419.

    Article  Google Scholar 

  • Mejia, C. R., Valladares-Garrido, M. J., Miñan-Tapia, A., Serrano, F. T., Tobler-Gómez, L. E., Pereda-Castro, W., et al. (2017). Use, knowledge, and perception of the scientific contribution of sci-hub in medical students: Study in six countries in latin america. PloS ONE, 12(10), e0185,673.

    Article  Google Scholar 

  • Milkman, K. L., & Berger, J. (2014). The science of sharing and the sharing of science. Proceedings of the National Academy of Sciences, 111(Supplement 4), 13,642–13,649.

    Article  Google Scholar 

  • Nazarovets, S. A. (2018). Black Open Access in Ukraine: Analysis of Downloading Sci-Hub Publications by Ukranian Internet Users. Science and Innovation, 14(2), 19–24. https://doi.org/10.15407/scine14.02.019.

    Article  Google Scholar 

  • Nicholas, D., Boukacem-Zeghmouri, C., Xu, J., Herman, E., Clark, D., Abrizah, A., et al. (2019). Sci-hub: The new and ultimate disruptor? view from the front. Learned Publishing, 32(2), 147–153.

    Article  Google Scholar 

  • Novo, L. A. B., & Onishi, V. C. (2017). Could sci-hub become a quicksand for authors? Information Development, 33(3), 324–325. https://doi.org/10.1177/0266666917703638.

    Article  Google Scholar 

  • O’Loughlin, J., & Sidaway, J. D. (2020). Commercial publishers: What is to be done? Geoforum, 112, 6–8. https://doi.org/10.1016/j.geoforum.2019.12.011.

    Article  Google Scholar 

  • Paulus, F. M., Rademacher, L., Schäfer, T. A. J., Müller-Pinzler, L., & Krach, S. (2015). Journal impact factor shapes scientists reward signal in the prospect of publication. PloS ONE, 10(11), e0142,537.

    Article  Google Scholar 

  • Peet, L. (2016). Sci-Hub Sparks Critique of Librarian. Library Journal, 141(15), 14–17.

    Google Scholar 

  • Pinto, T., & Teixeira, A. A. C. (2020). The impact of research output on economic growth by fields of science: a dynamic panel data analysis, 1980–2016. Scientometrics,. https://doi.org/10.1007/s11192-020-03419-3.

  • Radicchi, F., Fortunato, S., & Castellano, C. (2008). Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences of the United States of America, 105(45), 17,268–17,272. https://doi.org/10.1073/pnas.0806977105.

    Article  Google Scholar 

  • Saleem, F., Hasaali, M. A., & Haq, Nu. (2017). Sci-hub & ethical issues. Research in Social & Administrative Pharmacy, 13(1), 253. https://doi.org/10.1016/j.sapharm.2016.09.001.

    Article  Google Scholar 

  • Seguin, J. (2019). The future of access: How a mosaic of next-gen solutions will deliver more convenient access to more users. Information Services & Use, 39(3), 237–242. https://doi.org/10.3233/ISU-190049.

    Article  Google Scholar 

  • Sekara, V., Deville, P., Ahnert, S. E., Barabási, A. L., Sinatra, R., & Lehmann, S. (2018). The chaperone effect in scientific publishing. Proceedings of the National Academy of Sciences, 115(50), 12,603–12,607.

    Article  Google Scholar 

  • Shuai, X., Pepe, A., & Bollen, J. (2012). How the scientific community reacts to newly submitted preprints: Article downloads, twitter mentions, and citations. PLoS ONE, 7(11), e47,523.

    Article  Google Scholar 

  • Sinatra R, Wang D, Deville P, Song C, Barabisi AL (2016) Quantifying the evolution of individual scientific impact. Science 354(6312), doi: https://doi.org/10.1126/science.aaf5239

  • Smith, L. D., Best, L. A., Stubbs, D. A., Archibald, A. B., & Roberson-Nay, R. (2002). Constructing knowledge: The role of graphs and tables in hard and soft psychology. American Psychologist, 57(10), 749.

    Article  Google Scholar 

  • Solomon, D. J. (2014). A survey of authors publishing in four megajournals. PeerJ, 2014(1), e365.

    Article  Google Scholar 

  • Solomon, D. J., & Björk, B. C. (2012). Publication fees in open access publishing: Sources of funding and factors influencing choice of journal. Journal of the American Society for Information Science and Technology, 63(1), 98–107.

    Article  Google Scholar 

  • Stasinopoulos, M., Rigby, R. A., Heller, G. Z., Voudouris, V., & De Bastiana, F. (2017). Flexible Regression and Smoothing Using GAMLSS in R. Boca Ratn, USA: CRC Press.

    Book  Google Scholar 

  • Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.

    Article  Google Scholar 

  • Strielkowski, W. (2017). Will the rise of sci-hub pave the road for the subscription-based access to publishing databases? Information Development, 33(5), 540–542.

    Article  Google Scholar 

  • Sá, MJ., Ferreira, C.M., Serpa S. (2019). Science communication and online social networks: Challenges and opportunities. Knowledge Management: An International Journal, 19(2), 1–22.

    Google Scholar 

  • Till, B. M., Rudolfson, N., Saluja, S., Gnanaraj, J., Samad, L., Ljungman, D., et al. (2019). Who is pirating medical literature? A bibliometric review of 28 million Sci-Hub downloads. Lancet Global Health, 7(1), E30–E31. https://doi.org/10.1016/S2214-109X(18)30388-7.

    Article  Google Scholar 

  • Varki, A. (2017). Scientific journals: Rename the impact factor. Nature, 548(7668), 393.

    Article  Google Scholar 

  • Zhang, Z., & Van Poucke, S. (2017). Citations for randomized controlled trials in sepsis literature: the halo effect caused by journal impact factor. PloS ONE, 12(1), e0169,398.

    Article  Google Scholar 

  • Zhu, J., & Liu, W. (2020). A tale of two databases: the use of Web of Science and Scopus in academic papers. Scientometrics,. https://doi.org/10.1007/s11192-020-03387-8.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan C. Correa.

Appendix

Appendix

This appendix aims at providing detailed guidance for both understandings and reproducing the results of “The Sci-Hub Effect: Sci-Hub downloads lead to more article citation.” A first analysis is presented in “Part 1” section, which is composed of six subsections. In each of these subsections we provide the arguments that allow our readers understand how we reach more accurate estimates (i.e., robust estimates). “Part 2” section presents a second analysis with a similar structure and purpose as the preceding section. The difference between the first and the second analysis relies on the statistical techniques employed. The techniques employed in both analyses follow the recommendations from the perspective of multiverse analysis Steegen et al. (2016), and allowed us to discard all other confounding factors that could lead to a misleading interpretation of the results.

Part 1

We cleaned the data set by omitting missing information. Then, the amount of data is reduced to 8131 observations. We used these remaining observations to examine the behavior of outliers with box-plots and descriptive statistics. Both Fig. 3 and Table 1 reveal the presence of extreme values, particularly for citations, the number of pages, and the number of figures and tables. The presence of these outliers leads us to evaluate their influence on the regression models that are presented in the following subsections.

Fig. 3
figure 3

Boxplot diagnostics

Table 1 Descriptive Statistics

Regression diagnostics

An additional scatterplot analysis reveals a positive relationship between the number of citations and the number of Sci-Hub downloads (See Fig. 4). This relationship, however, can be distorted by the outliers already mentioned.

Fig. 4
figure 4

Scatterplot diagnostic

Table 2 summarizes the results of a multiple regression. There is a positive and significant relationship between the number of Sci-Hub downloads and the number of citations. Number of citations is also positively associated with the number of figures, authors per article, impact factor, and the H-index of both the first and last author of each paper. The length of the title has a negative relationship with the number of citations (i.e., papers with lengthy titles tend to have fewer citations), but this impact proved to be non-significant. The chaperone effect, the number of pages, the total number of tables, and the country resources, as captured by the GDP per capita and nature index, did not significantly predict the citations of a paper (at a statistical significance level of 5%).

Table 2 OLS Model

Based on the results of Table 2, we can make some regression diagnostics (e.g., the fulfillment of the assumptions of the model and the effects of the outliers on the results). These diagnostics allow us to decide the type of parameter estimation method that best suits the data. We begin by conducting a residual analysis to test the following assumption: \(E(\epsilon | X)=0\). To do this, we depict the residuals against fitted values in Panel (a) of Fig. 5. Individual estimates in this graph must be interpreted by comparing their distance to zero (i.e., the larger the distance from zero, the worse its estimate). We observe that some values can significantly alter the results of regressions. Our second analysis focuses on testing the normality of the residuals through a Q-Q plot, as depicted in Panel (b) of Fig. 5. We notice that in both tails, several points do not fit the line, invalidating the results of the regressions (in particular, the confidence intervals and the significance tests). In Panel (c) in Fig. 5 we evaluate the i.i.d. assumption, particularly that of homoscedasticity. We notice that points are over the red line, indicating that the residuals have uniform variance. Again, the outlier points undermine this relationship, implying problems of heteroscedasticity. Finally, Cook’s distance shows us that some points are very far from their average, as captured by Panel (d) in Fig. 5.

Fig. 5
figure 5

Plot diagnostics: a Residuals vs Fitted values; b Normal Q-Q; c Scale-Location; d Cook’s distance

In Table 3, we used a deletion diagnostic to identify which influential observations may cause a substantial change in the fit when they are excluded from the model. We used the following measures of influence when ith observation is deleted: (a) DFFIT (how much the regression function changes), (b) DFBETA (how much the coefficients change), (c) COVRATIO (how much the covariance matrix change), (d) \(D^2\) (Cooks distance, how much the entire regression function changes), and (e) hat-values (for detecting high-leverage observations). In the literature, it is common to point out that an observation is considered unusual if it is detected by at least one of the aforementioned influence measures. Although many observations meet this condition, we only show a few for the sake of brevity. We noticed that observations such as 1952 or 2223 stand out using any measure of influence, demonstrating how detrimental these points can be for the results of the regression analysis.

Table 3 Analysis of the influence of outliers

Dealing with outliers

So far, we found the presence of outliers that threaten the validity of traditional regression analyses. One possible solution, given the study of the regression diagnosis, would consist of eliminating the problematic observations. However, with this technique, valuable information is lost. Instead, we use a robust regression that is less sensitive to outliers ?. Table 4 shows the results for the model presented in Table 2 estimated by robust regression, through the use of iterated re-weighted least squares. Robust regression assigns a higher weighting to observations that generate a lower residual. Comparing the results of OLS and Robust regressions shows that coefficients, signs, and statistical significance are very different, revealing a strong influence of outliers on model parameters in the OLS regression.

Table 4 Robust Regression

Dealing with heteroscedasticity

An important assumption in traditional regression models is that errors must be homoscedastic. The violation of this assumption can lead to the use of covariance matrix estimators that can be inconsistently estimated. Although a first exploration was already carried out through graphical analysis, we test this assumption in our model through the Breusch-Pagan test. As expected in the cross-section models, the test shows the presence of heteroscedasticity problems (\(BP = 283.76, df = 14, p-value < 2.2e-16\)) whose solution consists of employing heteroscedasticity-consistent estimators, through the Huber-White basic sandwich estimator. Table 5 shows the results for regression with robust standard errors.

Table 5 Regression with Robust Standard Errors

With this correction, the results are similar to the results of OLS with regard to the sign, magnitude, and statistical significance of the coefficients, but different from those of a robust regression in the size of the coefficients.

Dealing with endogeneity

Another assumption in OLS models is that of endogeneity, which takes place when one of the independent variables is related to the residual term in the regression equation. In that case, the OLS estimates can be spurious. The traditional technique to correct this problem is using instrumental variables. However, the application of this method needs to generate external instruments that are not always available. Here, we rely on Lewbel’s methodology to evaluate the endogeneity problem (Lewbel 2012). Although our results are based on R packages, the application of the Lewbel’s methodology is best developed in Stata, particularly the tests of overidentifying restrictions. Sargans statistic is not robust in the presence of conditional heteroskedasticity, so we rely on Hansen J statistic. Table 6 shows the results obtained through the Lewbel’s method, assuming that Sci-Hub and Nature Index variables are endogenous. The Hansen test allows us to test the orthogonality conditions for the instruments. The results mentioned above indicate that the model may have an endogeneity problem, so it needs to be instrumented in different models. It should be clear that the tests of assumptions of OLS models allow us to conclude that our proposed regression model is affected by outliers, heterosdasticity and endogeneity problems. To overcome these problems, our results will be presented using robust regression, regression with robust standard errors, and instrumented variables based on heteroscedasticity.

Table 6 Regression with instrumented variables based on Heterocedasticity

The results of Table 6 show that the majority of variables proved to be significant predictors of citations, except the title length, the chaperon degree, the total tables, and the resources of the affiliated country, as captured by the GDP per capita and Nature index.

Results of robustness analysis

In this section, we present our final results and test the robustness of them by using different sets of models and methods. The following equation gives the specification we are trying to estimate:

$$\begin{aligned} C_i = \beta _i \times SciHub_i + X_i^{'} \gamma _i + \sum _{j=1}^{4}\delta _{ij}\times discipline_{ij} + \sum _{k=1}^{12}\varphi _{ik}\times journal_{ik} + \theta _i \end{aligned}$$
(2)

Where \(C_i\) stands for the number of citations the paper i has received, \(\beta \) is our parameter of interest as it quantifies the relationship between the citation of a paper and the number of times the paper i was downloaded through SciHub; \(X'\) is a vector containing the following control variables: The impact factor of the journal where the paper was published; the length of the title of the paper, as captured by the number of types or unique words in it; the number of graphs included in the paper i for communicating scientific findings, the number of tables included in the paper i; the chaperone effect captured by the H-index of the first and last author of paper i; the number of authors of the paper i. \(\theta _i\) represents the residuals of our model.

A reasonable assumption in our model would be that each discipline and journal have different citation patterns. Given the variability intrinsically associated with the scientific discipline and the particular journal where the paper was published, we also include dummies for discipline and journal type to control for hidden confounds. The above specification could be understood as an extended specification of Eq. 2 in the main manuscript. Table 7 shows the results of the estimates of robust regression.

Table 7 Effects of Sci-Hub on citations based on Robust Regression

We run Eq. 2 again. However, this time we introduce blocks of variables gradually to conduct a sensitivity analysis. First, model 1 does not include any control variables. Here, the number of times the paper was downloaded from Sci-Hub has a positive and significant effect on the number of citations. The following model introduces the dummies by the type of discipline (i.e., multidisciplinary, economics, consumer, or neuroscience) and journal. In this case, the results for Sci-Hub remain almost unchanged. In the third model, we added a series of variables related to the characteristics of the document (i.e., number of figures, tables, pages and the extension of the title). Once again, Sci-Hub remains robust to this new specification. The number of figures and tables included in a paper both show a positive and significant relationship with the number of citations. Conversely, the pages and the length of the title show the opposite association. Next, we introduced a new block of variables related to the characteristics of the authors (i.e., the H-index of both the first and last author, the number of authors of the paper, and the chaperone degree). The introduction of these new variables does not change the results for the Sci-Hub effect on article citations. All variables reveal positive and significant effects for citations except the chaperon degree (at a statistical significance level of 5%). In model 5, we introduced variables related to the context in which the authors and journals operate (such as the GDP per capita for the country of the authors, the impact factor of the journal, and the nature index). In this model, the Sci-Hub coefficient is still positive and highly significant. For the rest of the variables, only the impact factor seems to correspond to the expectations in terms of sign and statistical significance. Finally, we added all the control variables in the same model. The results remain unchanged from the previous specifications. In Table 8, we run the same models but now we estimate the parameters with a heteroscedasticity correction by using robust errors.

Table 8 Effects of Sci-Hub on citations based on Robust Standard Errors

Regardless of the specification we use, the results show that the effect of Sci-Hub on the number of citations remains positive and significant. Concerning the characteristics of the document or the authors, the results vary for some variables. For example, variables such as the number of tables, the extension of the title, or the chaperon effect do not prove to be very robust to different specifications. Finally, in Table 9, we estimate our models by tackling the endogeneity problem. In general terms, the models show good performance. We were able to verify the validity of the instruments, except for models 1 and 3, when they were evaluated through the Hansen J statistic. As can be seen, the effect of Sci-Hub on article citations remains robust to different specifications, while the other variables have a similar behavior to that of Table 8, except for the variables related to the context in which the authors and journals operate (i.e., impact factor, author i’s GDP per capita, author i’s nature index).

Table 9 Effects of Sci-Hub on citations using Heteroskedasticity errors

Part 2

We begin the analysis by focusing on the marginal distribution of data. Figure 6a depicts the complete distribution. The presence of an outlier article with more than 9000 citations is evident in Fig. 6b. After excluding the article, it is clear that there are still some articles with more than 2000 citations (see Fig. 6c). By removing those articles, it is possible to obtain a smoother distribution (see Fig. 6e and f).

Fig. 6
figure 6

Density a and empirical cumulative distribution function of the original data set b, and after excluding citation values higher than 9000 c and higher than 2000 d. The last row displays adjusted boxplots of the orginal dataset e AND after excluding articles with citations >2000 f

Fitting the shape of the marginal distribution

Generalized Additive Models for Location, Scale and Shape (GAMLSS) is the most optimal and flexible approach for modeling these data Stasinopoulos et al. (2017). GAMLSS allows fitting several count distributions to the marginal distribution and compare their goodness of fit via the Generalized Akaike Information Criterion (GAIC). Table 10 shows the GAIC results of the different tested distributions. The results indicate that the Zero Inflated Beta Negative Binomial (ZIBNB) and the Zero Adjusted (Hurdle) Beta Negative Binomial (ZABNB) distributions gave the best fit. Figure 7 shows the empirical cumulative distribution function (ECDF) plots of the data and four adjusted distributions.

Table 10 Results of the GAMLS fit procedure. Distributions are sorted in ascending order according to their GAIC values
Fig. 7
figure 7

The data’s ECDF plot and the ECDFs of four fitted distributions

Statistical analyses

As shown above, the four-parameters Zero inflated beta negative binomial (‘ZIBNB‘) gave the best fit. Hence, the data were modelled with this distribution. The first two parameters of the ‘ZIBNB‘ are \(\mu \) and \(\sigma \) and they represent the distributions’ location and scale. Recall that GAMLSS enables to model location (e.g. mean), scale (e.g. SD) and shape (i.e. skewness and kurtosis). For simplicity, though, only the location is modelled. For the case of numeric covariates, besides linear modeling, GAMLSS allows modeling covariates via smothers (e.g. penalized B-splines, monotone P-splines, loess curves). Also, via the package ‘gamlss.util‘ it is possible to use neural networks, decision tress, and others (see pages 24 to 25 in Stasinopoulos et al. (2017)). The results of the location modeling are shown in Table 11 (these results are ranked according to their absolute t-values).

Table 11 Ranking of the variables based on their absolute values. Values for intercepts are not shown

Final model

A forward and backward stepwise variable selection procedure applied to a model with all variables, suggested the following model:

$$\begin{aligned} C_i&= \alpha + \beta _1 \times SciHub_i + \beta _2 \times APA_i + \beta _3 \times TL_i + \beta _4 \times HIN_i + \beta _5 \times HI1_i +\beta _6 \times CE_i\\&+ \beta _7 \times TG_i + \beta _8 \times TT_i + \beta _9 \times IF_i + \beta _{10} \times GDPpc_i + \beta _{11} \times NI_i + u_i \end{aligned}$$

where, \(C_i\) is the number of citations the paper i has received; SciHub is the number of times the paper i was downloaded through SciHub; APA is the number of authors per article; TL is the length of the title of the paper; HIN and HI1 are the H-index of the first and last author of each paper; CE is the chaperone effect; TG and TT are the numbers of graphs and tables included in the paper; IF is the impact factor of the journal where the paper was published; GDPpc is the GDP per capita of the first author, and NI is the nature index.

By removing the variables not included in this model, the AIC went from 82692.05 to 82685.60.

Referring back to Table 11, and in order to render the model more parsimonious, it could be argued that if three variables were to be kept, then, in this order, they would be: the number of Sci-Hub downloads (ScihubN), and the total number of Figures in the published paper (Total.of.figures.y); ’Citations’ being the dependent variable. The impact factor of the publishing journal (IF) could also be considered as another good predictor of citations.

Figure 8 displays the results of associations between ‘Scihub‘ and ‘Citations‘ (A) and between ‘Total number of figures‘ and ‘Citations‘ (B). Figure 8a and b show linear and non-linear fitting lines. The linear fitting is performed via least median of squares (LMS) regression and the non-linear fitting is done via locally weighted scatterplot smoothing (LOWESS). The three-way associations are graphed using ordinary least squares planks (Fig. 8d) and locally estimated scatterplot smoothing (LOESS).

Fig. 8
figure 8

Bivariate (a and b) and trivariate associations (c and d) among variables of interest

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Correa, J.C., Laverde-Rojas, H., Tejada, J. et al. The Sci-Hub effect on papers’ citations. Scientometrics 127, 99–126 (2022). https://doi.org/10.1007/s11192-020-03806-w

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03806-w

Keywords

Navigation