skip to main content
10.1145/3643991.3644931acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

On the Executability of R Markdown Files

Published: 02 July 2024 Publication History

Abstract

R Markdown files are examples of literate programming documents that combine R code with results and explanations. Such dynamic documents are designed to execute easily and reproduce study results. However, little is known about the executability of R Markdown files which can cause frustration among its users who intend to reuse the document. This paper presents a large-scale study on the executability of R Markdown files collected from GitHub. Results from our study show that a significant number of R Markdown files (64.95%) are not executable, even after our best efforts. To better understand the challenges, we categorize the exceptions encountered while executing the documents into different categories. Finally, we develop a classifier to determine which Markdown files are likely to be executable. Such a classifier can be utilized by search engines in their ranking which helps developers to find literate programming documents as learning resources.

References

[1]
2023. Automagic. Retrieved 2023-11-17 from https://github.com/cole-brokamp/automagic
[2]
2023. Checkpoint. Retrieved 2023-11-17 from https://cran.r-project.org/web/packages/checkpoint
[3]
2023. Cloc. Retrieved 2023-11-17 from https://github.com/hrbrmstr/cloc
[4]
2023. Devtools. Retrieved 2023-11-17 from https://devtools.r-lib.org/
[5]
2023. GitHub Rest API documentation. Retrieved 2023-11-17 from https://docs.github.com/en/rest?apiVersion=2022-11-28
[6]
2023. Jetpack. Retrieved 2023-11-17 from https://github.com/ankane/jetpack
[7]
2023. Knitr. Retrieved 2023-11-17 from https://www.r-project.org/nosvn/pandoc/knitr.html
[8]
2023. Libraries.io. Retrieved 2023-11-17 from https://libraries.io/data
[9]
2023. Lintr. Retrieved 2023-11-17 from https://lintr.r-lib.org/
[10]
2023. Packrat. Retrieved 2023-11-17 from https://rstudio.github.io/packrat/
[11]
2023. Readability. Retrieved 2023-11-17 from https://pypi.org/project/readability/
[12]
2023. Renv. Retrieved 2023-11-17 from https://cran.r-project.org/web/packages/renv/index.html
[13]
2023. Scikit-learn. Retrieved 2023-11-17 from https://scikit-learn.org/
[14]
J. Anderson. 1983. LIX and RIX: Variations on a Little-Known Readability Index. 26, 6 (1983), 490--496.
[15]
L. Breiman. 2001. Random Forests. 45, 1 (2001), 5--32.
[16]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. 16, 1 (2002), 321--357.
[17]
T. Chen and C. Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 785--794.
[18]
J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. 20 (1960), 37--46.
[19]
R. O. Duda, P. E. Hart, and D. G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, USA.
[20]
G. Gousios and D. Spinellis. 2012. GHTorrent: Github's data from a firehose. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR). 12--21.
[21]
E. Horton and C. Parnin. 2018. Gistable: Evaluating the Executability of Python Code Snippets on GitHub. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). 217--227.
[22]
E. Horton and C. Parnin. 2019. Dockerizeme: Automatic inference of environment dependencies for Python code snippets. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE). 328--338.
[23]
W. M. Ibrahim, N. Bettenburg, E. Shihab, B. Adams, and A. E. Hassan. 2010. Should I Contribute to This Discussion?. In Proceedings of the 7th IEEE Working Conference on Mining Software Repositories (MSR). 181--190.
[24]
D. W. Hosmer Jr., S. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. Vol. 398. John Wiley & Sons.
[25]
K. Kelley and K. J. Preacher. 2012. On Effect Size. Vol. 17. 137--152 pages.
[26]
J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch 8--75. US Naval Air Station.
[27]
D. E. Knuth. 1984. Literate Programming. Vol. 17. 137--152 pages.
[28]
H. Mann and D. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. 50--60 pages.
[29]
G. H. McLaughlin. 1969. SMOG grading---a new readability formula. 12, 8 (1969), 639--646.
[30]
S. Mirhosseini and C. Parnin. 2020. Docable: Evaluating the Executability of Software Tutorials. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 375--385.
[31]
S. Mondal, M. M. Rahman, and C. K. Roy. 2019. Can issues reported at Stack Overflow questions be reproduced? An exploratory study. In Proceedings of the IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 479--489.
[32]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. 2019. A large-scale study about quality and reproducibility of Jupyter notebooks. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR). 507--517.
[33]
J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning 1 (1986), 81--106.
[34]
G. Robert. 1952. The Technique of Clear Writing. McGraw-Hill, New York.
[35]
F. Rudolf. 1948. A New Readability Yardstick. 32, 3 (1948), 221--233.
[36]
R. J. Senter and E. A. Smith. 1967. Automated Readability Index. (1967), 1--14.
[37]
A. Trisovic, M. K. Lau, T. Pasquier, and M. Crosas. 2022. A large-scale study on research code quality and execution. Scientific Data (2022), 60.
[38]
C. Wang, R. Wu, H. Song, J. Shu, and G. Li. 2022. smartPip: A Smart Approach to Resolving Python Dependency Conflict Issues. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1--12.
[39]
J. Wang, T.Y. Kuo, L. Li, and A. Zeller. 2020. Assessing and Restoring Reproducibility of Jupyter Notebooks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 138--149.
[40]
J. Wang, L. Li, and A. Zeller. 2021. Restoring execution environments of Jupyter notebooks. In Proceedings of the 43rd International Conference on Software Engineering (ICSE). 1622--1633.
[41]
D. Yang, A. Hussain, and C. V Lopes. 2016. From query to usable code: an analysis of stack overflow code snippets. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR). 391--402.
[42]
C. Zhu, R.K. Saha, M.R. Prasad, and S. Khurshid. 2021. Restoring the executability of Jupyter notebooks by automatic upgrade of deprecated APIs. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 240--252.

Cited By

View all
  • (2024)Improving the Comprehension of R Programs by Hybrid Dataflow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695603(2490-2493)Online publication date: 27-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
April 2024
788 pages
ISBN:9798400705878
DOI:10.1145/3643991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. R Markdown
  2. GitHub
  3. executability
  4. literate programming

Qualifiers

  • Research-article

Conference

MSR '24
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving the Comprehension of R Programs by Hybrid Dataflow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695603(2490-2493)Online publication date: 27-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media