research-article

On the Executability of R Markdown Files

Authors:

Md Anaytul Islam,

Muhammad Asaduzzman,

Shaowei WangAuthors Info & Claims

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

Pages 254 - 264

https://doi.org/10.1145/3643991.3644931

Published: 02 July 2024 Publication History

Abstract

R Markdown files are examples of literate programming documents that combine R code with results and explanations. Such dynamic documents are designed to execute easily and reproduce study results. However, little is known about the executability of R Markdown files which can cause frustration among its users who intend to reuse the document. This paper presents a large-scale study on the executability of R Markdown files collected from GitHub. Results from our study show that a significant number of R Markdown files (64.95%) are not executable, even after our best efforts. To better understand the challenges, we categorize the exceptions encountered while executing the documents into different categories. Finally, we develop a classifier to determine which Markdown files are likely to be executable. Such a classifier can be utilized by search engines in their ranking which helps developers to find literate programming documents as learning resources.

References

[1]

2023. Automagic. Retrieved 2023-11-17 from https://github.com/cole-brokamp/automagic

[2]

2023. Checkpoint. Retrieved 2023-11-17 from https://cran.r-project.org/web/packages/checkpoint

[3]

2023. Cloc. Retrieved 2023-11-17 from https://github.com/hrbrmstr/cloc

[4]

2023. Devtools. Retrieved 2023-11-17 from https://devtools.r-lib.org/

[5]

2023. GitHub Rest API documentation. Retrieved 2023-11-17 from https://docs.github.com/en/rest?apiVersion=2022-11-28

[6]

2023. Jetpack. Retrieved 2023-11-17 from https://github.com/ankane/jetpack

[7]

2023. Knitr. Retrieved 2023-11-17 from https://www.r-project.org/nosvn/pandoc/knitr.html

[8]

2023. Libraries.io. Retrieved 2023-11-17 from https://libraries.io/data

[9]

2023. Lintr. Retrieved 2023-11-17 from https://lintr.r-lib.org/

[10]

2023. Packrat. Retrieved 2023-11-17 from https://rstudio.github.io/packrat/

[11]

2023. Readability. Retrieved 2023-11-17 from https://pypi.org/project/readability/

[12]

2023. Renv. Retrieved 2023-11-17 from https://cran.r-project.org/web/packages/renv/index.html

[13]

2023. Scikit-learn. Retrieved 2023-11-17 from https://scikit-learn.org/

[14]

J. Anderson. 1983. LIX and RIX: Variations on a Little-Known Readability Index. 26, 6 (1983), 490--496.

[15]

L. Breiman. 2001. Random Forests. 45, 1 (2001), 5--32.

[16]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. 16, 1 (2002), 321--357.

[17]

T. Chen and C. Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 785--794.

[18]

J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. 20 (1960), 37--46.

[19]

R. O. Duda, P. E. Hart, and D. G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, USA.

[20]

G. Gousios and D. Spinellis. 2012. GHTorrent: Github's data from a firehose. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR). 12--21.

[21]

E. Horton and C. Parnin. 2018. Gistable: Evaluating the Executability of Python Code Snippets on GitHub. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). 217--227.

[22]

E. Horton and C. Parnin. 2019. Dockerizeme: Automatic inference of environment dependencies for Python code snippets. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE). 328--338.

[23]

W. M. Ibrahim, N. Bettenburg, E. Shihab, B. Adams, and A. E. Hassan. 2010. Should I Contribute to This Discussion?. In Proceedings of the 7th IEEE Working Conference on Mining Software Repositories (MSR). 181--190.

[24]

D. W. Hosmer Jr., S. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. Vol. 398. John Wiley & Sons.

[25]

K. Kelley and K. J. Preacher. 2012. On Effect Size. Vol. 17. 137--152 pages.

[26]

J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch 8--75. US Naval Air Station.

[27]

D. E. Knuth. 1984. Literate Programming. Vol. 17. 137--152 pages.

[28]

H. Mann and D. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. 50--60 pages.

[29]

G. H. McLaughlin. 1969. SMOG grading---a new readability formula. 12, 8 (1969), 639--646.

[30]

S. Mirhosseini and C. Parnin. 2020. Docable: Evaluating the Executability of Software Tutorials. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 375--385.

[31]

S. Mondal, M. M. Rahman, and C. K. Roy. 2019. Can issues reported at Stack Overflow questions be reproduced? An exploratory study. In Proceedings of the IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 479--489.

[32]

J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. 2019. A large-scale study about quality and reproducibility of Jupyter notebooks. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR). 507--517.

[33]

J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning 1 (1986), 81--106.

[34]

G. Robert. 1952. The Technique of Clear Writing. McGraw-Hill, New York.

[35]

F. Rudolf. 1948. A New Readability Yardstick. 32, 3 (1948), 221--233.

[36]

R. J. Senter and E. A. Smith. 1967. Automated Readability Index. (1967), 1--14.

[37]

A. Trisovic, M. K. Lau, T. Pasquier, and M. Crosas. 2022. A large-scale study on research code quality and execution. Scientific Data (2022), 60.

[38]

C. Wang, R. Wu, H. Song, J. Shu, and G. Li. 2022. smartPip: A Smart Approach to Resolving Python Dependency Conflict Issues. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1--12.

[39]

J. Wang, T.Y. Kuo, L. Li, and A. Zeller. 2020. Assessing and Restoring Reproducibility of Jupyter Notebooks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 138--149.

[40]

J. Wang, L. Li, and A. Zeller. 2021. Restoring execution environments of Jupyter notebooks. In Proceedings of the 43rd International Conference on Software Engineering (ICSE). 1622--1633.

[41]

D. Yang, A. Hussain, and C. V Lopes. 2016. From query to usable code: an analysis of stack overflow code snippets. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR). 391--402.

Digital Library

[42]

C. Zhu, R.K. Saha, M.R. Prasad, and S. Khurshid. 2021. Restoring the executability of Jupyter notebooks by automatic upgrade of deprecated APIs. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 240--252.

Cited By

Sihler FFilkov VRay BZhou M(2024)Improving the Comprehension of R Programs by Hybrid Dataflow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695603(2490-2493)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695603

Recommendations

Why Markdown as a Pricing Modality?

Markdown as a pricing modality is ubiquitous in retail whereas everyday low price (EDLP) remains relatively rare (despite its several advantages, such as simplicity). This paper explores whether and why retailers can use either of these pricing modalities ...
Elucidative programming

In this paper we introduce Elucidative Programming as a variant of Literate Programming. Literate Programming represents the idea of organizing a source program in an essay that documents the program understanding. An elucidative program connects ...
R Markdown

Reproducibility is increasingly important to statistical research, but many details are often omitted from the published version of complex statistical analyses. A reader's comprehension is limited to what the author concludes, without exposure to the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

April 2024

788 pages

ISBN:9798400705878

DOI:10.1145/3643991

Chair:
Diomidis Spinellis,
Program Chair:
Alberto Bacchelli,
Program Co-chair:
Eleni Constantinou

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MSR '24

Sponsor:

SIGSOFT

MSR '24: 21st International Conference on Mining Software Repositories

April 15 - 16, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
27
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sihler FFilkov VRay BZhou M(2024)Improving the Comprehension of R Programs by Hybrid Dataflow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695603(2490-2493)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695603

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents