On the relationship between download and citation counts: An introduction of Granger-causality inference
Introduction
Of various measurements for characterizing scientific impact, citations stand out in its frequent use in many disciplines. The reliance on citations and its variants, from the h-index (Hirsch, 2005) to the g-index (Egghe, 2006), from journal impact factors (Amin & Mabe, 2000; Garfield, 1972, 2006) to Eigenfactors (Bergstrom, 2007), lies in the consciousness, though debatable, that citations are a quantitative proxy of a scientific discovery’s importance (Bornmann & Daniel, 2008; Bu, Waltman, & Huang, 2019; Tahamtan & Bornmann, 2019; Wang, Song, & Barabási, 2013). As such, these measurements have been widely conducted to assess national science policies and disciplinary development (Tijssen, Raan, & Leeuwen, 2002), departments and research laboratories (Narin, 1976), books and journals (Garfield, 1972), and individual scientists (Hirsch, 2005; Sinatra, Wang, Deville, Song, & Barabási, 2016) for several decades.
Nevertheless, citation-based measurements, especially the raw number of citations, have been criticized for a long time because of their broad range of drawbacks (Radicchi, Fortunato, & Castellano, 2008; Waltman & van Eck, 2015; Waltman & Yan, 2014). For instance, they may not be observed immediately after publication; rather, people have to wait for the citations that later emerge in the published literature to be included in bibliographic databases (Guerrero-Bote & Moya-Anegón, 2014). In some cases, such process may oftentimes take several years (Powell, 2016) that seriously exacerbates the delay problem. This motivated scientists to find an early indicator for characterizing scientific impact. Hence, some have proposed alternatives to quantify scientific impact, such as using the number of downloads (Appell, 2007; Guerrero-Bote & Moya-Anegón, 2014; Jamali & Nikzad, 2011; Lippi & Favaloro, 2013; Moed, 2005; Nieder, Dalhaug, & Aandahl, 2013; Schloegl & Gorraiz, 2010; Watson, 2009; Xin, Alberto, & Johan, 2012) that are not subject to citation delays (Kurtz & Bollen, 2011). For example, ScienceDirect, a website offering access to a large number of scientific and medical research, selects the top 25 “hottest” articles according to download-based measurements (Gorraiz, Gumpenberger, & Schlögl, 2014).
In the meantime, the relationship between citations and downloads has been explored in various disciplines (Appell, 2007; Guerrero-Bote & Moya-Anegón, 2014; Jamali & Nikzad, 2011; Kurtz & Henneken, 2017; Lippi & Favaloro, 2013; McGillivray & Astell, 2019; Moed & Halevi, 2016; Moed, 2005; Nieder et al., 2013; Schloegl & Gorraiz, 2010; Watson, 2009; Xin et al., 2012). Yet, most inquiries were purely based upon the correlation coefficient that leads to insufficient discussion on the extent to which they are correlated. For instance, Schloegl and Gorraiz (2010) examined 2625 articles published in Gynecologic Oncology and found that the correlation coefficient between the numbers of downloads and citations equals 0.410. However, many details behind this coefficient are missing. For example, little has been known about whether their relationship is from downloads to citations, or in the opposite way. Yet, the directionality issue could draw profound implications—for instance, if downloads were found to lead citations (“downloads→citations”, regardless of positive or negative), funding agencies and research assessors might consider downloads as an early signal for predicting scientific impact of publications, scientists, and/or academic journals.
Besides directionality, the issue of time lags between downloads and citations has not been well addressed. Moed (2005), for example, observed that within three months after an article was cited, the download rate increased by 25 % when compared with that without being cited. He also found the correlation coefficient between downloads and citations ranges from 0.11 to 0.35, with the lowest between early downloads and citations, and the highest between citation and download rates calculated by ignoring the early situation. Although the time lag between downloads and citations was considered in this work, the strategy of determining time lag length seems quite subjective—a specific, constant length of lag was setting for all publications in their empirical study. Yet, as an example, Fig. 1 shows two articles from the Lancet where one can observe that the time lag between downloads and citations in the first case is obviously smaller than the second. Thus, we argue that the lag might be various for different publications and that a single, fixed value should not be used for analyses for all cases.
To examine the directionality of download and citation, this paper introduces a widely adopted strategy outside information science to study the relation between two time series, namely Granger-causality inference (Granger, 1969). When the historical information of the download (or citation) of a paper is useful to predict the subsequent citation (or download), the download (or citation) is considered the Granger cause for another one. In other words, when a paper’s downloads are observed to have an impact on the probability of its being cited subsequently, and the occurrence of these two indicators is in chronological order, such as the download occurring before the citation, it can be considered that the download is the Granger cause for the citation, and vice versa (Granger, 1969). Although Granger causality in the statistical sense is not the ultimate basis for affirming or denying real causation, considering the historical information about indicators and the order in which downloads and citations occur provides more nuances than does correlation analysis. The Granger-causality inference has been used in various disciplines, such as economics (Aslan, 2014; Calderón & Liu, 2003; Dritsakis, 2004; Farmer, 2015; Khan, Bajuri, Yoke, & Khan, 2014; Kholdy & Sohrabian, 2005; Sehrawat & Giri, 2018), politics (Shahbaz, Shabbir, Malik, & Wolters, 2013; Wood, 1992), management science (Weersink & Tauer, 1991), medical science (Bose, Hravnak, & Sereika, 2017; Roebroeck, Formisano, & Goebel, 2005), environmental science (Reichel, Thejll, & Lassen, 2001; Triacca, 2005), computer science (Friston, Moran, & Seth, 2013; Kamiński, Ding, Truccolo, & Bressler, 2001), and biochemistry (Barnett & Seth, 2014; Nedungadi, Rangarajan, Jain, & Ding, 2009). For instance, in management and economics, Ashley, Granger, and Schmalensee (1980) applied the Granger-causality inference to understanding the directionality between two time series, namely aggregate consumption and aggregate advertising, and found that the former could be used to predict the latter but not vice versa. Weersink and Tauer (1991) examined the Granger causal relationship between dairy farm size and productivity in Canadian states, and concluded that the change in productivity and average herd size appear driven by price changes. Ashley et al. (1980) analyzed the relationship between aggregate advertising and aggregate consumption in the U.S. by constructing a binary system of 80 quarterly aggregate advertising and aggregate consumption from 1956 to 1975, and found that fluctuations in aggregate consumption cause fluctuations in aggregate advertising. No significant statistics suggesting that advertising changes affect consumption were solved. In political science, Freeman (1983) discussed the usefulness of applying Granger-causality inference in the study of political relationships between the arms expenditures either of the U.S. and Union of Soviet Socialist Republics, or India and Pakistan, and concluded that there is a Granger-causal relationship between the arms expenditures of both dyads.
Despite the widespread application of the Granger-causality inference, there are only a limited number of studies in information science that have adopted this strategy. For example, Lee, Lin, Chuang, and Lee (2011) attempted to infer the causal relationships between scientific outputs and economic development by using the Granger-causality inference method. Particularly, they observed a significant relation between the number of scientific publications and Gross Domestic Product (GDP) for Asian countries but not western countries. The current paper explores the directionality of downloads and citations based upon 2007–2017 publications in the Lancet. To accurately determine the time lag between downloads and citations, we implement the Granger-causality inference case by case. Besides, another research objective of this paper is to introduce the strategy of Granger-causal inference to the field of information science in a step-by-step manner so that future information scientists can apply this method to their own research contexts.
Section snippets
Data
In response to the need for comparability of the time series of downloads and citations, we choose the Mendeley database to acquire downloads and citations. Mendeley is a popular reference management tool and social network that allows scientists to find, read, and import articles from Mendeley databases or other providers. In Mendeley, that a scientist (user) “adds” an article to his/her library indicates that this article is “read”. Meanwhile, Mendeley calculates the number of readers of each
Principle
Most strategies analyzing time series have a strong assumption that the time series to be investigated is stationary. Statistically, a stationary time series has constant statistical properties (e.g., mean, variance, auto-correlation, etc.) over time (in the weak sense). Intuitively, the examples shown in Fig. 3 illustrate a stationary (left) and a non-stationary (right) time series where one can observe that a stationary time series does not have any obvious upward or downward trend or
Principles
The vector auto-regression (VAR) model (Ashley et al., 1980) is a frequently used strategy in Granger-causality inference to determine the length of time lag between two time series, annotated as . Sims (Sims, 1980) later introduced this model to economics for studying dynamic and temporal relationships between economic indicators. In a VAR model, suppose that we have two time series, and . Mathematically, we conduct a regression model:
Principle
Recall that in Step 1, we divide all publications into three types, Type I (both download and citation time series are stationary), Type II (either download or citation time series is stationary), and Type III (neither download and citation time series is stationary). For Types II and III publications, if their first-order time series are found to be stationary, we can retain them for Step 2. Yet, it is quite likely that the two original time series are non-stationary but have a long-term
Principle
Once implementing the aforementioned three steps, we start to establish the Granger causal inference. In this process, the null hypothesis is that the time series X does not “Granger cause” the other time series Y, or Y does not “Granger cause” X. We identify the results by quantifying the p-values. For example, when the 95 % confident interval is set, if the test is significant (e.g., the p value is less than 0.05), we reject the null hypothesis and confirm that there is a Granger causal
Conclusions
This paper explores potential relations between the numbers of citations and downloads of scientific publications beyond correlation by using a Granger-causality test. Scientific publications of the Lancet, as well as their citation and download counts, are employed in our empirical study. We find that there is significant directionality between these two variables for many publications. Yet, the detailed patterns vary case by case. This indicates that from a research evaluation perspective,
Author contributions
Beibei Hu: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.
Yang Ding: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.
Xianlei Dong: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis.
Yi Bu: Conceived and designed the analysis, Wrote the paper.
Ying Ding: Conceived
Declaration of Competing Interest
The authors declare no competing interests.
Acknowledgments
The authors would like to thank the constructive comments from two anonymous reviewers. This research is funded by the programs of National Natural Science Foundation of China (Grant No.: 71904110), Humanities and Social Science Foundation of Ministry of Education of the People’s Republic of China (Grant No.: 19YJCGJW014), China postdoctoral science foundation (Grant No.: 2017M610440), National Natural Science Foundation of China (Grant No.: 71701115), and National Nature Science Foundation of
References (72)
- et al.
Type 1 diabetes
The Lancet
(2014) - et al.
The MVGC multivariate Granger causality toolbox: A new approach to Granger-causal inference
Journal of Neuroscience Methods
(2014) - et al.
Strategic science with policy impact
The Lancet
(2015) - et al.
The direction of causality between financial development and economic growth
Journal of Development Economics
(2003) - et al.
Bangladesh: Innovating for health
The Lancet
(2013) - et al.
Analysing connectivity with Granger causality and dynamic causal modelling
Current Opinion in Neurobiology
(2013) Identifying restrictions of linear equations with applications to simultaneous equations and cointegration
Journal of Econometrics
(1995)- et al.
The impacts of corruption, macroeconomic instability and market competitiveness on bank’s profitability
International Journal of Information Processing and Management
(2014) - et al.
Article downloads and citations: Is there any relationship?
Clinica Chimica Acta; International Journal of Clinical Chemistry
(2013) - et al.
Global malaria mortality between 1980 and 2010: A systematic analysis
The Lancet
(2012)
Mapping directed influence over the brain using Granger causality and fMRI
Neuroimage
An analysis of a causal relationship between economic growth and terrorism in Pakistan
Economic Modelling
Fertility preservation for age-related fertility decline
The Lancet
Field-normalized citation impact indicators and the choice of an appropriate counting method
Journal of Informetrics
Statistical predictor identification
Annals of the Institute of Statistical Mathematics
A new look at the statistical model identification
IEEE Transactions on Automatic Control
Impact factors: Use and abuse
International Journal of Environment Science and Technology
Is the future of scientific journals electronic? Some considerations about downloads and citations
International Journal of Sports Medicine
Advertising and aggregate consumption: An analysis of causality
Econometrica
Tourism development and economic growth in the Mediterranean countries: Evidence from panel Granger causality tests
Current Issues in Tourism
Eigenfactor: Measuring the value and prestige of scholarly journals
College & Research Libraries News
What do citation counts measure? A review of studies on citing behavior
Journal of Documentation
Vector autoregressive models and granger causality in time series analysis in nursing research: Dynamic changes among vital signs prior to cardiorespiratory instability events as an example
Nursing Research
A multidimensional perspective on the citation impact of scientific publications
Distribution of the estimators for autoregressive time series with a unit root
Journal of the American Statistical Association
Tourism as a long-run economic growth factor: An empirical investigation for Greece using causality analysis
Tourism Economics
Theory and practise of the g-index
Scientometrics
Co-integration and error correction: Representation, estimation, and testing
Source: Econometrica Econometrica
The stock market crash really did cause the great recession
Oxford Bulletin of Economics and Statistics
Granger causality and the times series analysis of political relationships
American Journal of Political Science
Citation analysis as a tool in journal evaluation
Science
The history and meaning of the journal impact factor
JAMA
Usage versus citation behaviours in four subject areas
Scientometrics
Investigating causal relations by econometric models and cross-spectral methods
Econometrica
Spurious regressions in econometrics
Relationship between downloads and citations at journal and paper levels, and the influence of language
Scientometrics
Cited by (8)
Interrelation measurement based on the multi-layer limited penetrable horizontal visibility graph
2022, Chaos, Solitons and FractalsCitation Excerpt :When studying the relationship between systems, we should not only study whether there is an interaction between systems but also clarify the causality or synchronicity between the systems. There have been many methods to measure the causality or synchronicity relationship between systems, such as the Granger test and the Event-synchronization method [42–44]. Here we propose a new method based on DTMLPHVG.
Current Advances of Time Series Analysis in Information Science: Tasks, Processes and Problems
2023, Documentation, Information and KnowledgeReview and Prospect of the Phonetic Research of Xiang Dialects in Recent Forty Years:Based on Knowledge Mapping and Bibliometric Analysis
2022, 21st Chinese National Conference on Computational Linguistic, CCL 2022