On the relationship between download and citation counts: An introduction of Granger-causality inference

doi:10.1016/j.joi.2020.101125

Journal of Informetrics

Volume 15, Issue 2, May 2021, 101125

https://doi.org/10.1016/j.joi.2020.101125 Get rights and content

Highlights

•
Studies the relationship between download and citation counts of scientific publications.
•
Introduces the Granger-causality inference method into Information Science for studying time series directionality.
•
Considers the length of time series of publications case by case.

Abstract

Studies on the relationship between the numbers of citations and downloads of scientific publications is beneficial for understanding the mechanism of citation patterns and research evaluation. However, seldom studies have considered directionality issues between downloads and citations or adopted a case-by-case time lag length between the download and citation time series of each individual publication. In this paper, we introduce the Granger-causal inference strategy to study the directionality between downloads and citations and set up the length of time lag between the time series for each case. By researching the publications on the Lancet, we find that publications have various directionality patterns, but highly cited publications tend to feature greater possibilities to have Granger causality. We apply a step-by-step manner to introduce the Granger-causal inference method to information science as four steps, namely conducting stationarity tests, determining time lag between time series, establishing cointegration test, and implementing Granger-causality inference. We hope that this method can be applied by future information scientists in their own research contexts.

Introduction

Of various measurements for characterizing scientific impact, citations stand out in its frequent use in many disciplines. The reliance on citations and its variants, from the h-index (Hirsch, 2005) to the g-index (Egghe, 2006), from journal impact factors (Amin & Mabe, 2000; Garfield, 1972, 2006) to Eigenfactors (Bergstrom, 2007), lies in the consciousness, though debatable, that citations are a quantitative proxy of a scientific discovery’s importance (Bornmann & Daniel, 2008; Bu, Waltman, & Huang, 2019; Tahamtan & Bornmann, 2019; Wang, Song, & Barabási, 2013). As such, these measurements have been widely conducted to assess national science policies and disciplinary development (Tijssen, Raan, & Leeuwen, 2002), departments and research laboratories (Narin, 1976), books and journals (Garfield, 1972), and individual scientists (Hirsch, 2005; Sinatra, Wang, Deville, Song, & Barabási, 2016) for several decades.

Nevertheless, citation-based measurements, especially the raw number of citations, have been criticized for a long time because of their broad range of drawbacks (Radicchi, Fortunato, & Castellano, 2008; Waltman & van Eck, 2015; Waltman & Yan, 2014). For instance, they may not be observed immediately after publication; rather, people have to wait for the citations that later emerge in the published literature to be included in bibliographic databases (Guerrero-Bote & Moya-Anegón, 2014). In some cases, such process may oftentimes take several years (Powell, 2016) that seriously exacerbates the delay problem. This motivated scientists to find an early indicator for characterizing scientific impact. Hence, some have proposed alternatives to quantify scientific impact, such as using the number of downloads (Appell, 2007; Guerrero-Bote & Moya-Anegón, 2014; Jamali & Nikzad, 2011; Lippi & Favaloro, 2013; Moed, 2005; Nieder, Dalhaug, & Aandahl, 2013; Schloegl & Gorraiz, 2010; Watson, 2009; Xin, Alberto, & Johan, 2012) that are not subject to citation delays (Kurtz & Bollen, 2011). For example, ScienceDirect, a website offering access to a large number of scientific and medical research, selects the top 25 “hottest” articles according to download-based measurements (Gorraiz, Gumpenberger, & Schlögl, 2014).

In the meantime, the relationship between citations and downloads has been explored in various disciplines (Appell, 2007; Guerrero-Bote & Moya-Anegón, 2014; Jamali & Nikzad, 2011; Kurtz & Henneken, 2017; Lippi & Favaloro, 2013; McGillivray & Astell, 2019; Moed & Halevi, 2016; Moed, 2005; Nieder et al., 2013; Schloegl & Gorraiz, 2010; Watson, 2009; Xin et al., 2012). Yet, most inquiries were purely based upon the correlation coefficient that leads to insufficient discussion on the extent to which they are correlated. For instance, Schloegl and Gorraiz (2010) examined 2625 articles published in Gynecologic Oncology and found that the correlation coefficient between the numbers of downloads and citations equals 0.410. However, many details behind this coefficient are missing. For example, little has been known about whether their relationship is from downloads to citations, or in the opposite way. Yet, the directionality issue could draw profound implications—for instance, if downloads were found to lead citations (“downloads→citations”, regardless of positive or negative), funding agencies and research assessors might consider downloads as an early signal for predicting scientific impact of publications, scientists, and/or academic journals.

Besides directionality, the issue of time lags between downloads and citations has not been well addressed. Moed (2005), for example, observed that within three months after an article was cited, the download rate increased by 25 % when compared with that without being cited. He also found the correlation coefficient between downloads and citations ranges from 0.11 to 0.35, with the lowest between early downloads and citations, and the highest between citation and download rates calculated by ignoring the early situation. Although the time lag between downloads and citations was considered in this work, the strategy of determining time lag length seems quite subjective—a specific, constant length of lag was setting for all publications in their empirical study. Yet, as an example, Fig. 1 shows two articles from the Lancet where one can observe that the time lag between downloads and citations in the first case is obviously smaller than the second. Thus, we argue that the lag might be various for different publications and that a single, fixed value should not be used for analyses for all cases.

To examine the directionality of download and citation, this paper introduces a widely adopted strategy outside information science to study the relation between two time series, namely Granger-causality inference (Granger, 1969). When the historical information of the download (or citation) of a paper is useful to predict the subsequent citation (or download), the download (or citation) is considered the Granger cause for another one. In other words, when a paper’s downloads are observed to have an impact on the probability of its being cited subsequently, and the occurrence of these two indicators is in chronological order, such as the download occurring before the citation, it can be considered that the download is the Granger cause for the citation, and vice versa (Granger, 1969). Although Granger causality in the statistical sense is not the ultimate basis for affirming or denying real causation, considering the historical information about indicators and the order in which downloads and citations occur provides more nuances than does correlation analysis. The Granger-causality inference has been used in various disciplines, such as economics (Aslan, 2014; Calderón & Liu, 2003; Dritsakis, 2004; Farmer, 2015; Khan, Bajuri, Yoke, & Khan, 2014; Kholdy & Sohrabian, 2005; Sehrawat & Giri, 2018), politics (Shahbaz, Shabbir, Malik, & Wolters, 2013; Wood, 1992), management science (Weersink & Tauer, 1991), medical science (Bose, Hravnak, & Sereika, 2017; Roebroeck, Formisano, & Goebel, 2005), environmental science (Reichel, Thejll, & Lassen, 2001; Triacca, 2005), computer science (Friston, Moran, & Seth, 2013; Kamiński, Ding, Truccolo, & Bressler, 2001), and biochemistry (Barnett & Seth, 2014; Nedungadi, Rangarajan, Jain, & Ding, 2009). For instance, in management and economics, Ashley, Granger, and Schmalensee (1980) applied the Granger-causality inference to understanding the directionality between two time series, namely aggregate consumption and aggregate advertising, and found that the former could be used to predict the latter but not vice versa. Weersink and Tauer (1991) examined the Granger causal relationship between dairy farm size and productivity in Canadian states, and concluded that the change in productivity and average herd size appear driven by price changes. Ashley et al. (1980) analyzed the relationship between aggregate advertising and aggregate consumption in the U.S. by constructing a binary system of 80 quarterly aggregate advertising and aggregate consumption from 1956 to 1975, and found that fluctuations in aggregate consumption cause fluctuations in aggregate advertising. No significant statistics suggesting that advertising changes affect consumption were solved. In political science, Freeman (1983) discussed the usefulness of applying Granger-causality inference in the study of political relationships between the arms expenditures either of the U.S. and Union of Soviet Socialist Republics, or India and Pakistan, and concluded that there is a Granger-causal relationship between the arms expenditures of both dyads.

Despite the widespread application of the Granger-causality inference, there are only a limited number of studies in information science that have adopted this strategy. For example, Lee, Lin, Chuang, and Lee (2011) attempted to infer the causal relationships between scientific outputs and economic development by using the Granger-causality inference method. Particularly, they observed a significant relation between the number of scientific publications and Gross Domestic Product (GDP) for Asian countries but not western countries. The current paper explores the directionality of downloads and citations based upon 2007–2017 publications in the Lancet. To accurately determine the time lag between downloads and citations, we implement the Granger-causality inference case by case. Besides, another research objective of this paper is to introduce the strategy of Granger-causal inference to the field of information science in a step-by-step manner so that future information scientists can apply this method to their own research contexts.

Section snippets

Data

In response to the need for comparability of the time series of downloads and citations, we choose the Mendeley database to acquire downloads and citations. Mendeley is a popular reference management tool and social network that allows scientists to find, read, and import articles from Mendeley databases or other providers. In Mendeley, that a scientist (user) “adds” an article to his/her library indicates that this article is “read”. Meanwhile, Mendeley calculates the number of readers of each

Principle

Most strategies analyzing time series have a strong assumption that the time series to be investigated is stationary. Statistically, a stationary time series has constant statistical properties (e.g., mean, variance, auto-correlation, etc.) over time (in the weak sense). Intuitively, the examples shown in Fig. 3 illustrate a stationary (left) and a non-stationary (right) time series where one can observe that a stationary time series does not have any obvious upward or downward trend or

Principles

The vector auto-regression (VAR) model (Ashley et al., 1980) is a frequently used strategy in Granger-causality inference to determine the length of time lag between two time series, annotated as $p$ . Sims (Sims, 1980) later introduced this model to economics for studying dynamic and temporal relationships between economic indicators. In a VAR model, suppose that we have two time series, $X_{t} = [x_{1}, x_{2}, \dots, x_{t}]$ and $Y_{t} = [y_{1}, y_{2}, \dots, y_{t}]$ . Mathematically, we conduct a regression model: $\{\begin{cases} Y_{t} = a_{1} + A_{11} Y_{t - 1} + A_{12} Y_{t - 2} + \dots + A_{1 p} Y \end{cases}$

Principle

Recall that in Step 1, we divide all publications into three types, Type I (both download and citation time series are stationary), Type II (either download or citation time series is stationary), and Type III (neither download and citation time series is stationary). For Types II and III publications, if their first-order time series are found to be stationary, we can retain them for Step 2. Yet, it is quite likely that the two original time series are non-stationary but have a long-term

Principle

Once implementing the aforementioned three steps, we start to establish the Granger causal inference. In this process, the null hypothesis is that the time series X does not “Granger cause” the other time series Y, or Y does not “Granger cause” X. We identify the results by quantifying the p-values. For example, when the 95 % confident interval is set, if the test is significant (e.g., the p value is less than 0.05), we reject the null hypothesis and confirm that there is a Granger causal

Conclusions

This paper explores potential relations between the numbers of citations and downloads of scientific publications beyond correlation by using a Granger-causality test. Scientific publications of the Lancet, as well as their citation and download counts, are employed in our empirical study. We find that there is significant directionality between these two variables for many publications. Yet, the detailed patterns vary case by case. This indicates that from a research evaluation perspective,

Author contributions

Beibei Hu: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.

Yang Ding: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.

Xianlei Dong: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis.

Yi Bu: Conceived and designed the analysis, Wrote the paper.

Ying Ding: Conceived

Declaration of Competing Interest

The authors declare no competing interests.

Acknowledgments

The authors would like to thank the constructive comments from two anonymous reviewers. This research is funded by the programs of National Natural Science Foundation of China (Grant No.: 71904110), Humanities and Social Science Foundation of Ministry of Education of the People’s Republic of China (Grant No.: 19YJCGJW014), China postdoctoral science foundation (Grant No.: 2017M610440), National Natural Science Foundation of China (Grant No.: 71701115), and National Nature Science Foundation of

References (72)

M.A. Atkinson et al.
Type 1 diabetes
The Lancet
(2014)
L. Barnett et al.
The MVGC multivariate Granger causality toolbox: A new approach to Granger-causal inference
Journal of Neuroscience Methods
(2014)
K.D. Brownell et al.
Strategic science with policy impact
The Lancet
(2015)
C. Calderón et al.
The direction of causality between financial development and economic growth
Journal of Development Economics
(2003)
P. Das et al.
Bangladesh: Innovating for health
The Lancet
(2013)
K. Friston et al.
Analysing connectivity with Granger causality and dynamic causal modelling
Current Opinion in Neurobiology
(2013)
S. Johansen
Identifying restrictions of linear equations with applications to simultaneous equations and cointegration
Journal of Econometrics
(1995)
H. Khan et al.
The impacts of corruption, macroeconomic instability and market competitiveness on bank’s profitability
International Journal of Information Processing and Management
(2014)
G. Lippi et al.
Article downloads and citations: Is there any relationship?
Clinica Chimica Acta; International Journal of Clinical Chemistry
(2013)
C.J. Murray et al.
Global malaria mortality between 1980 and 2010: A systematic analysis
The Lancet
(2012)

A. Roebroeck et al.

Mapping directed influence over the brain using Granger causality and fMRI

Neuroimage

(2005)

M. Shahbaz et al.

An analysis of a causal relationship between economic growth and terrorism in Pakistan

Economic Modelling

(2013)

D. Stoop et al.

Fertility preservation for age-related fertility decline

The Lancet

(2014)

L. Waltman et al.

Field-normalized citation impact indicators and the choice of an appropriate counting method

Journal of Informetrics

(2015)

H. Akaike

Statistical predictor identification

Annals of the Institute of Statistical Mathematics

(1970)

H. Akaike

A new look at the statistical model identification

IEEE Transactions on Automatic Control

(1974)

M. Amin et al.

Impact factors: Use and abuse

International Journal of Environment Science and Technology

(2000)

H.J. Appell

Is the future of scientific journals electronic? Some considerations about downloads and citations

International Journal of Sports Medicine

(2007)

R. Ashley et al.

Advertising and aggregate consumption: An analysis of causality

Econometrica

(1980)

A. Aslan

Tourism development and economic growth in the Mediterranean countries: Evidence from panel Granger causality tests

Current Issues in Tourism

(2014)

C. Bergstrom

Eigenfactor: Measuring the value and prestige of scholarly journals

College & Research Libraries News

(2007)

L. Bornmann et al.

What do citation counts measure? A review of studies on citing behavior

Journal of Documentation

(2008)

E. Bose et al.

Vector autoregressive models and granger causality in time series analysis in nursing research: Dynamic changes among vital signs prior to cardiorespiratory instability events as an example

Nursing Research

(2017)

Y. Bu et al.

A multidimensional perspective on the citation impact of scientific publications

(2019)

D.A. Dickey et al.

Distribution of the estimators for autoregressive time series with a unit root

Journal of the American Statistical Association

(1979)

N. Dritsakis

Tourism as a long-run economic growth factor: An empirical investigation for Greece using causality analysis

Tourism Economics

(2004)

L. Egghe

Theory and practise of the g-index

Scientometrics

(2006)

R.F. Engle et al.

Co-integration and error correction: Representation, estimation, and testing

Source: Econometrica Econometrica

(1987)

R.E.A. Farmer

The stock market crash really did cause the great recession

Oxford Bulletin of Economics and Statistics

(2015)

J.R. Freeman

Granger causality and the times series analysis of political relationships

American Journal of Political Science

(1983)

E. Garfield

Citation analysis as a tool in journal evaluation

Science

(1972)

E. Garfield

The history and meaning of the journal impact factor

JAMA

(2006)

J. Gorraiz et al.

Usage versus citation behaviours in four subject areas

Scientometrics

(2014)

C.W.J. Granger

Investigating causal relations by econometric models and cross-spectral methods

Econometrica

(1969)

C.W. Granger et al.

Spurious regressions in econometrics

V.P. Guerrero-Bote et al.

Relationship between downloads and citations at journal and paper levels, and the influence of language

Scientometrics

(2014)

Cited by (8)

Carbon emission causal discovery and multi-step forecasting for global cities
2024, Cities
The increasing threat of global climate change is primarily caused by rising carbon emissions, with cities acting as significant contributors. This study bridges two vital gaps in urban carbon neutrality research: unraveling the causal dynamics of carbon emissions within urban networks and forecasting emission trends. This study proposes a reinforcement learning-based causal discovery algorithm, progressively deciphering the complex causal relationships in global urban emissions, and facilitating the creation of directed acyclic causal graphs. Furthermore, a hyperbolic graph neural network-based forecasting algorithm is introduced, through integrated fusion curvature to improve the information transfer between cities, for predicting global urban emission trends. A comparative analysis positions these innovative algorithms against leading methods, using emission data from thousands of cities for predictions one, five, and ten steps ahead. The experiment employs prediction error metrics, Taylor statistics, the Diebold-Mariano test, and the ablation analysis for validation. Results reveal proposed causal discovery algorithm effectively identifies the causality of carbon emissions among cities, while the forecasting algorithm leads other competing models across all prediction ranges. Based on the effectiveness of the algorithms, this study decodes the significant nature of the global urban carbon emission network, offering policy insights for collaborative carbon mitigation in cities worldwide.
Interrelation measurement based on the multi-layer limited penetrable horizontal visibility graph
2022, Chaos, Solitons and Fractals
Citation Excerpt :
When studying the relationship between systems, we should not only study whether there is an interaction between systems but also clarify the causality or synchronicity between the systems. There have been many methods to measure the causality or synchronicity relationship between systems, such as the Granger test and the Event-synchronization method [42–44]. Here we propose a new method based on DTMLPHVG.
Interrelation measurement is the basis of big data mining. This paper proposes an efficient method to measure the dynamic correlation and synchronicity relationship of multidimensional data using the microscopic topological structure of a multi-layer network. In order to measure the dynamic correlation between multidimensional data, multidimensional data are transformed into a time-varying multi-layer limited penetrable horizontal visibility graph network. On this basis, a time-varying correlation measurement index of multidimensional data based on the microscopic structure of the interlayer network connection is proposed. In addition, based on the degree distribution of each layer and information entropy theory, a time-varying information measurement index of multidimensional data is introduced. Further, to measure the synchronicity relationship between multidimensional data, time-delay parameters are defined, and a method to transform multidimensional data into a delay time-varying multi-layer limited penetrable horizontal visibility graph network is developed. A symmetrical and antisymmetrical combinations index is defined to measure the synchronicity relationship and to determine which system leads the others. Numerical simulation verifies the effectiveness of the proposed index and the proposed method's robustness to handle data disturbed by noise. Finally, an empirical analysis is conducted using the price data of the energy and carbon markets. The dynamic relationship between the crude oil future and gasoline future market is obtained. The dynamic information spillover effect between the carbon and energy markets is analyzed.
Quantifying the Lead-Lag Effect of Research between Conference Papers and Journal Papers
2023, SSRN
Current Advances of Time Series Analysis in Information Science: Tasks, Processes and Problems
2023, Documentation, Information and Knowledge
Bibliometric Profile of an Emerging Journal: Participatory Educational Research
2022, arXiv
Review and Prospect of the Phonetic Research of Xiang Dialects in Recent Forty Years:Based on Knowledge Mapping and Bibliometric Analysis
2022, 21st Chinese National Conference on Computational Linguistic, CCL 2022

View all citing articles on Scopus

View full text

On the relationship between download and citation counts: An introduction of Granger-causality inference

Highlights

Abstract

Introduction

Section snippets

Data

Principle

Principles

Principle

Principle

Conclusions

Author contributions

Declaration of Competing Interest

Acknowledgments

The Lancet

Journal of Neuroscience Methods

The Lancet

Journal of Development Economics

The Lancet

Current Opinion in Neurobiology

Journal of Econometrics

International Journal of Information Processing and Management

Clinica Chimica Acta; International Journal of Clinical Chemistry

The Lancet

Neuroimage

Economic Modelling

The Lancet

Journal of Informetrics

Statistical predictor identification

Annals of the Institute of Statistical Mathematics

A new look at the statistical model identification

IEEE Transactions on Automatic Control

Impact factors: Use and abuse

International Journal of Environment Science and Technology

Is the future of scientific journals electronic? Some considerations about downloads and citations

International Journal of Sports Medicine

Advertising and aggregate consumption: An analysis of causality

Econometrica

Tourism development and economic growth in the Mediterranean countries: Evidence from panel Granger causality tests

Current Issues in Tourism

Eigenfactor: Measuring the value and prestige of scholarly journals

College & Research Libraries News

What do citation counts measure? A review of studies on citing behavior

Journal of Documentation

Vector autoregressive models and granger causality in time series analysis in nursing research: Dynamic changes among vital signs prior to cardiorespiratory instability events as an example

Nursing Research

A multidimensional perspective on the citation impact of scientific publications

Distribution of the estimators for autoregressive time series with a unit root

Journal of the American Statistical Association

Tourism as a long-run economic growth factor: An empirical investigation for Greece using causality analysis

Tourism Economics

Theory and practise of the g-index

Scientometrics

Co-integration and error correction: Representation, estimation, and testing

Source: Econometrica Econometrica

The stock market crash really did cause the great recession

Oxford Bulletin of Economics and Statistics

Granger causality and the times series analysis of political relationships

American Journal of Political Science

Citation analysis as a tool in journal evaluation

Science

The history and meaning of the journal impact factor

JAMA

Usage versus citation behaviours in four subject areas

Scientometrics

Investigating causal relations by econometric models and cross-spectral methods

Econometrica

Spurious regressions in econometrics

Relationship between downloads and citations at journal and paper levels, and the influence of language

Scientometrics