Regular articleHow is R cited in research outputs? Structure, impacts, and citation standard
Introduction
This paper concerns the citation and mention of scientific software in research papers. It is commonly accepted that proper citation of research datasets is important to the growing field of data science, because it provides a basic mechanism for linking datasets to other scientific entities. This linkage provides a fundamental infrastructure for other scientific tasks, including data sharing, data reuse, and reproducible research (e.g., Mooney & Newton, 2012). As data objects in their own right, scientific software environments, applications, and packages are also amenable to this infrastructure of citation, with similar benefits for researchers.
Earlier evidence suggests that data citation practices are highly inconsistent, largely because authors lack standards and policies to guide them in citing datasets (Belter, 2014, Mooney, 2011, Mooney and Newton, 2012). These findings have inspired the development of several data citation standards (e.g., Altman and King, 2007, King, 2007, Starr and Gastl, 2011) and the identification of key principles for citing datasets (Altman, Borgman, & Crosas, 2015; Altman and Crosas, 2013, Martone, 2014).
Parallel developments are underway in the study of software citation: researchers have provided some early analysis of software citation trends (e.g., Howison & Bullard, 2015; Li, Greenberg, & Lin, 2016; Pan, Yan, Wang, & Hua, 2015) and have proposed guiding principles for future works (Smith, Katz, & Niemeyer, 2016). However, several factors limit the quality of information obtainable from current software citation practices. For example, existing studies tend to describe software as a single unit in the scientific workflow; this does not always reflect the way in the software is used.
Increasingly, scientific software is extensible. Extensibility is one of the most important principles of software design (Johansson & Löfgren, 2009). R, a statistical and data visualization software and environment, represents a successful example of this structure (Gentleman et al., 2004). As an open computational ecosystem, R allows for the expansion of its functionality through packages, which are defined as a “fundamental unit of sharable code” (Wickham, 2015, p. 3). The package, however, is not the smallest unit: often, only a few functions are invoked in a piece of source code which interacts with the data. In principle, the more detailed information an author includes about the use of software entities, the easier it is for others to reproduce the study. Much of this information, however, is not seen as important and is thus omitted from scientific papers (Kallet, 2004, Knorr, 1981, Knorr and Knorr, 1978).
Because scientific software is distributed and deployed in such a granular fashion, a balanced approach to citation is needed to accommodate the increasing needs of reproducibility. Although our current study focuses on R specifically, it is our broader goal to understand this balance more generally, along with its implications for future citation standards. Our present investigation is framed by the following four questions:
- 1.
How is the R cited or mentioned in research papers?
- 2.
How are the various packages of R cited or mentioned in research papers?
- 3.
How are R functions cited or mentioned in research papers?
- 4.
How have citation patterns for R and its packages changed over time?
In this analysis, 391 research papers—all of which mentioned R—were selected from all eight journals belonging to the Public Library of Science (PLoS) publishing project. Content analysis was then performed on the full texts of these papers.
The following section is a review of the literature on data citation, software citation, and empirical studies focusing on R. After this, we discuss our research methods, followed by our results, which are broadly grouped by the three levels of the R ecosystem: the core software, packages, and functions. We conclude by discussing the implications of these results and proposing some avenues for future study.
Section snippets
Literature review
This section reviews the literature concerning data citation and software citation. We also survey previous research that takes the R ecosystem as its subject.
Data collection
In order to pursue our research questions, we sought a set of papers that was both related to R and as comprehensive as possible. We chose PLoS as our data source for two principal reasons: first, PLoS' public API allows for easy retrieval of full-text data in XML format. Second, PLoS has a strong focus on biology, a knowledge domain in which R is popular among researchers (Gentleman et al., 2004).
The official citation instructions for R (Hornik, 2016) require that the following text string be
Results
This section is organized in top-down fashion, considering software-level, package-level, and function-level results in sequence.
Software mention and citation
Our results underscore the popularity of R in the knowledge domains represented by PLoS journals. Our data suggests that more than 4% of papers cite R following official instructions, with about the same amount of papers mentioning R unofficially. This would make R the second most popular software, based on the results provided by Pan et al. (2015).
Our results also suggest that the citation rate of R software is consistent with the results of previous studies. Howison and Bullard (2015)
Conclusion
This study explored how R, together with its packages and functions, is mentioned and cited in PLoS. A sample of 391 papers was drawn from the nearly 9000 papers published from 2005 to 2016 that cite or mention R in an at least somewhat official manner. Although scientific software has only recently come to be seen as a citable data object, a study of software citation practices sheds light on important trends, some of which are consistent with data citation practices in general. There are,
Author contributions
Kai Li: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.
Erjia Yan: Conceived and designed the analysis, Other contribution (revised the paper).
Yuanyuan Feng: Performed the analysis (participated in coding), Other contribution (revised the paper).
References (65)
- et al.
Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers
Journal of Informetrics
(2015) - et al.
Exploring the determinants of scientific data sharing: Understanding the motivation to publish research data
Government Information Quarterly
(2013) Practical statistics for medical research
(1990)- et al.
The evolution of data citation: From principles to implementation
IAssist Quarterly
(2013) - et al.
A proposed standard for the scholarly citation of quantitative data
D-Lib Magazine
(2007) - et al.
An introduction to the joint principles for data citation
Bulletin of the Association for Information Science and Technology
(2015) Publication Manual of the American Psychological Association
(2011)Measuring the value of research data: A citation analysis of oceanographic data sets
PLoS One
(2014)- et al.
Building software, building community: Lessons from the ROpenSci project
Journal of Open Research Software
(2015) Citing R packages [website]
(2017)
Toward information infrastructure studies: Ways of knowing in a networked environment. In International handbook of internet research
rplos: Interface to the Search ‘API' for ‘PLoS' Journals. R package version 0.5.6
The data paper: A mechanism to incentivize data publishing in biodiversity science
BMC Bioinformatics
The unsung heroes of scientific software
Nature
On the development and distribution of R packages: An empirical analysis of the R ecosystem
Bibliographic references for numeric social science data files: Suggested guidelines
Journal of the American Society for Information Science
Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates
Proceedings of the National Academy of Sciences
Statistics for Library and Information Services: A Primer for Using Open Source R Software for Accessibility and Visualization
Bioconductor: Open software development for computational biology and bioinformatics
Genome Biology
The evolution of the R software ecosystem
We need publishing standards for datasets and data tables
Learned Publishing
Citing R
Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature
Journal of the Association for Information Science and Technology
Structured abstracts for papers reporting clinical trials
Annals of Internal Medicine
The case for open computer programs
Nature
Designing for Extensibility: An action research study of maximizing extensibility by means of design principles (B.S. thesis)
Cited by (42)
In-code citation practices in open research software libraries
2021, Journal of InformetricsUnderstanding the Application of Science Mapping Tools in LIS and Non-LIS Domains
2020, Data and Information ManagementA Review of the Extraction and Evaluation of Knowledge Entities in Scientific Literature
2023, Journal of Modern Information