Elsevier

Journal of Informetrics

Volume 11, Issue 4, November 2017, Pages 989-1002
Journal of Informetrics

Regular article
How is R cited in research outputs? Structure, impacts, and citation standard

https://doi.org/10.1016/j.joi.2017.08.003Get rights and content

Highlights

  • R and its packages are having strong impacts in PLoS database.

  • R and its packages are cited in these papers in a fairly consistent way.

  • Complex structure of software should be considered by software citation standard.

Abstract

This paper addresses software citation by analyzing how R and its packages are cited in a sample of PLoS papers. A codebook is developed to support a content analysis of the full-text papers. Our results indicate that the software R and its packages are inconsistently cited, as is the case with other scientific software. The inconsistency derives partly from the variety of citation standards currently used for software, and partly from fact that these standards are not well followed by authors on multiple levels. This work sheds light on the future development of software citation standards, especially given the present landscape of conflicting citation practices. Moreover, our approach furnishes a possible blueprint for dealing with the granularity of software entities in scientific citation: we consider citations of the core R software environment, of specific R packages, and of individual functions.

Introduction

This paper concerns the citation and mention of scientific software in research papers. It is commonly accepted that proper citation of research datasets is important to the growing field of data science, because it provides a basic mechanism for linking datasets to other scientific entities. This linkage provides a fundamental infrastructure for other scientific tasks, including data sharing, data reuse, and reproducible research (e.g., Mooney & Newton, 2012). As data objects in their own right, scientific software environments, applications, and packages are also amenable to this infrastructure of citation, with similar benefits for researchers.

Earlier evidence suggests that data citation practices are highly inconsistent, largely because authors lack standards and policies to guide them in citing datasets (Belter, 2014, Mooney, 2011, Mooney and Newton, 2012). These findings have inspired the development of several data citation standards (e.g., Altman and King, 2007, King, 2007, Starr and Gastl, 2011) and the identification of key principles for citing datasets (Altman, Borgman, & Crosas, 2015; Altman and Crosas, 2013, Martone, 2014).

Parallel developments are underway in the study of software citation: researchers have provided some early analysis of software citation trends (e.g., Howison & Bullard, 2015; Li, Greenberg, & Lin, 2016; Pan, Yan, Wang, & Hua, 2015) and have proposed guiding principles for future works (Smith, Katz, & Niemeyer, 2016). However, several factors limit the quality of information obtainable from current software citation practices. For example, existing studies tend to describe software as a single unit in the scientific workflow; this does not always reflect the way in the software is used.

Increasingly, scientific software is extensible. Extensibility is one of the most important principles of software design (Johansson & Löfgren, 2009). R, a statistical and data visualization software and environment, represents a successful example of this structure (Gentleman et al., 2004). As an open computational ecosystem, R allows for the expansion of its functionality through packages, which are defined as a “fundamental unit of sharable code” (Wickham, 2015, p. 3). The package, however, is not the smallest unit: often, only a few functions are invoked in a piece of source code which interacts with the data. In principle, the more detailed information an author includes about the use of software entities, the easier it is for others to reproduce the study. Much of this information, however, is not seen as important and is thus omitted from scientific papers (Kallet, 2004, Knorr, 1981, Knorr and Knorr, 1978).

Because scientific software is distributed and deployed in such a granular fashion, a balanced approach to citation is needed to accommodate the increasing needs of reproducibility. Although our current study focuses on R specifically, it is our broader goal to understand this balance more generally, along with its implications for future citation standards. Our present investigation is framed by the following four questions:

  • 1.

    How is the R cited or mentioned in research papers?

  • 2.

    How are the various packages of R cited or mentioned in research papers?

  • 3.

    How are R functions cited or mentioned in research papers?

  • 4.

    How have citation patterns for R and its packages changed over time?

In this analysis, 391 research papers—all of which mentioned R—were selected from all eight journals belonging to the Public Library of Science (PLoS) publishing project. Content analysis was then performed on the full texts of these papers.

The following section is a review of the literature on data citation, software citation, and empirical studies focusing on R. After this, we discuss our research methods, followed by our results, which are broadly grouped by the three levels of the R ecosystem: the core software, packages, and functions. We conclude by discussing the implications of these results and proposing some avenues for future study.

Section snippets

Literature review

This section reviews the literature concerning data citation and software citation. We also survey previous research that takes the R ecosystem as its subject.

Data collection

In order to pursue our research questions, we sought a set of papers that was both related to R and as comprehensive as possible. We chose PLoS as our data source for two principal reasons: first, PLoS' public API allows for easy retrieval of full-text data in XML format. Second, PLoS has a strong focus on biology, a knowledge domain in which R is popular among researchers (Gentleman et al., 2004).

The official citation instructions for R (Hornik, 2016) require that the following text string be

Results

This section is organized in top-down fashion, considering software-level, package-level, and function-level results in sequence.

Software mention and citation

Our results underscore the popularity of R in the knowledge domains represented by PLoS journals. Our data suggests that more than 4% of papers cite R following official instructions, with about the same amount of papers mentioning R unofficially. This would make R the second most popular software, based on the results provided by Pan et al. (2015).

Our results also suggest that the citation rate of R software is consistent with the results of previous studies. Howison and Bullard (2015)

Conclusion

This study explored how R, together with its packages and functions, is mentioned and cited in PLoS. A sample of 391 papers was drawn from the nearly 9000 papers published from 2005 to 2016 that cite or mention R in an at least somewhat official manner. Although scientific software has only recently come to be seen as a citable data object, a study of software citation practices sheds light on important trends, some of which are consistent with data citation practices in general. There are,

Author contributions

Kai Li: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.

Erjia Yan: Conceived and designed the analysis, Other contribution (revised the paper).

Yuanyuan Feng: Performed the analysis (participated in coding), Other contribution (revised the paper).

References (65)

  • X. Pan et al.

    Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers

    Journal of Informetrics

    (2015)
  • D.S. Sayogo et al.

    Exploring the determinants of scientific data sharing: Understanding the motivation to publish research data

    Government Information Quarterly

    (2013)
  • D.G. Altman

    Practical statistics for medical research

    (1990)
  • M. Altman et al.

    The evolution of data citation: From principles to implementation

    IAssist Quarterly

    (2013)
  • M. Altman et al.

    A proposed standard for the scholarly citation of quantitative data

    D-Lib Magazine

    (2007)
  • M. Altman et al.

    An introduction to the joint principles for data citation

    Bulletin of the Association for Information Science and Technology

    (2015)
  • American Psychological Association

    Publication Manual of the American Psychological Association

    (2011)
  • C.W. Belter

    Measuring the value of research data: A citation analysis of oceanographic data sets

    PLoS One

    (2014)
  • C. Boettiger et al.

    Building software, building community: Lessons from the ROpenSci project

    Journal of Open Research Software

    (2015)
  • C. Boettiger

    Citing R packages [website]

    (2017)
  • G.C. Bowker et al.

    Toward information infrastructure studies: Ways of knowing in a networked environment. In International handbook of internet research

    (2009)
  • A. Burton et al.
  • S. Chamberlain et al.

    rplos: Interface to the Search ‘API' for ‘PLoS' Journals. R package version 0.5.6

    (2016)
  • V. Chavan et al.

    The data paper: A mechanism to incentivize data publishing in biodiversity science

    BMC Bioinformatics

    (2011)
  • D.S. Chawla

    The unsung heroes of scientific software

    Nature

    (2016)
  • M. Claes et al.
  • Contributed Packages (n.d.). Retrieved December 4, 2016, from...
  • Davis, P. (2016, January 6). PLOS ONE Shrinks by 11 Percent. Retrieved from...
  • A. Decan et al.

    On the development and distribution of R packages: An empirical analysis of the R ecosystem

  • S.A. Dodd

    Bibliographic references for numeric social science data files: Suggested guidelines

    Journal of the American Society for Information Science

    (1979)
  • A. Eklund et al.

    Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates

    Proceedings of the National Academy of Sciences

    (2016)
  • A. Friedman

    Statistics for Library and Information Services: A Primer for Using Open Source R Software for Accessibility and Visualization

    (2015)
  • R.C. Gentleman et al.

    Bioconductor: Open software development for computational biology and bioinformatics

    Genome Biology

    (2004)
  • D.M. German et al.

    The evolution of the R software ecosystem

  • T. Green

    We need publishing standards for datasets and data tables

    Learned Publishing

    (2009)
  • A.J. Hey et al.
    (2009)
  • Hong, N. C., Hole, B., & Moore, S. (2013). Software papers: improving the reusability and sustainability of scientific...
  • K. Hornik

    Citing R

    (2016)
  • J. Howison et al.

    Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature

    Journal of the Association for Information Science and Technology

    (2015)
  • E.J. Huth

    Structured abstracts for papers reporting clinical trials

    Annals of Internal Medicine

    (1987)
  • D.C. Ince et al.

    The case for open computer programs

    Nature

    (2012)
  • N. Johansson et al.

    Designing for Extensibility: An action research study of maximizing extensibility by means of design principles (B.S. thesis)

    (2009)
  • Cited by (42)

    View all citing articles on Scopus
    View full text