ABSTRACT
Background. Artifact evaluation has been introduced into the software engineering and programming languages research community with a pilot at ESEC/FSE 2011 and has since then enjoyed a healthy adoption throughout the conference landscape. Objective. In this qualitative study, we examine the expectations of the community toward research artifacts and their evaluation processes. Method. We conducted a survey including all members of artifact evaluation committees of major conferences in the software engineering and programming language field since the first pilot and compared the answers to expectations set by calls for artifacts and reviewing guidelines. Results. While we find that some expectations exceed the ones expressed in calls and reviewing guidelines, there is no consensus on quality thresholds for artifacts in general. We observe very specific quality expectations for specific artifact types for review and later usage, but also a lack of their communication in calls. We also find problematic inconsistencies in the terminology used to express artifact evaluation’s most important purpose – replicability. Conclusion. We derive several actionable suggestions which can help to mature artifact evaluation in the inspected community and also to aid its introduction into other communities in computer science.
Supplemental Material
- Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature News 533, 7604 ( 2016 ), 452. https://www.nature.com/news/1-500-scientists-lift-thelid-on-reproducibility-1. 19970Google ScholarCross Ref
- Anna Balazs. 2008. International vocabulary of metrology-basic and general concepts and associated terms. Chemistry International ( 2008 ), 20-1. https: //doi.org/10.1515/ci. 2008. 30.6. 21 Google ScholarCross Ref
- Victor Basili, Forrest Shull, and Filippo Lanubile. 1999. Building Knowledge through Families of Experiments. IEEE Trans. Softw. Eng. 25, 4 ( 1999 ), 456-473. https://doi.org/10.1109/32.799939 Google ScholarDigital Library
- Emery D. Berger, Celeste Hollenbeck, Petr Maj, Olga Vitek, and Jan Vitek. 2019. On the Impact of Programming Languages on Code Quality: A Reproduction Study. ACM Trans. Program. Lang. Syst. 41, 4, Article 21 (Oct. 2019 ), 24 pages. https://doi.org/10.1145/3340571 Google ScholarDigital Library
- Karl Broman, Mine Cetinkaya-Rundel, Amy Nussbaum, Christopher Paciorek, Roger Peng, Daniel Turek, and Hadley Wickham. 2017. Recommendations to funding agencies for supporting reproducible research. https://www.amstat. org/asa/files/pdfs/POL-ReproducibleResearchRecommendations.pdf. Accessed: 2020-09-03.Google Scholar
- B. R. Childers and P. K. Chrysanthis. 2017. Artifact Evaluation: Is It a Real Incentive?. In 2017 IEEE 13th International Conference on e-Science (e-Science). 488-489. https://doi.org/10.1109/eScience. 2017.79 Google ScholarCross Ref
- Christian Collberg and Todd A. Proebsting. 2016. Repeatability in Computer Systems Research. Commun. ACM 59, 3 (Feb. 2016 ), 62-69. https://doi.org/10. 1145/2812803 Google ScholarDigital Library
- Erin Dahlgren. 2019. Getting Research Software to Work: A Case Study on Artifact Evaluation for OOPSLA 2019. https://doi.org/10.5281/zenodo.4016657 Google ScholarCross Ref
- Association for Computing Machinery. 2018. Artifact Review and Badging. https://www.acm.org/publications/policies/artifact-review-badging. Accessed: 2020-09-03.Google Scholar
- Leonid Glanz, Sven Amann, Michael Eichberg, Michael Reif, Ben Hermann, Johannes Lerch, and Mira Mezini. 2017. CodeMatch: Obfuscation Won't Conceal Your Repackaged App. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017 ). Association for Computing Machinery, New York, NY, USA, 638-648. https://doi.org/10.1145/3106237.3106305 Google ScholarDigital Library
- Matthias Hauswirth. [n.d.]. Artifact Evaluation. http://evaluate.inf. usi.ch/ artifacts. Accessed 2020-09-03.Google Scholar
- Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community Expectations for Research Artifacts and Evaluation Processes-Data & Scripts. https://doi.org/ 10.5281/zenodo.3951724 Google ScholarDigital Library
- Robert Heumüller, Sebastian Nielebock, Jacob Krüger, and Frank Ortmeier. 2020. Publish or Perish, but do not Forget your Software Artifacts. Empirical Software Engineering ( 2020 ). https://doi.org/10.1007/s10664-020-09851-6 Preprint. Google ScholarCross Ref
- William Hudson. 2013. Card Sorting. In The Encyclopedia of Human-Computer Interaction. The Interaction Design Foundation, Chapter 22.Google Scholar
- Natalia Juristo and Sira Vegas. 2011. The Role of Non-exact Replications in Software Engineering Experiments. Empirical Software Engineering 16, 3 ( 2011 ), 295-324. https://doi.org/10.1007/s10664-010-9141-9 Google ScholarDigital Library
- Shriram Krishnamurthi and Jan Vitek. 2015. The Real Software Crisis: Repeatability As a Core Value. Commun. ACM 58, 3 (Feb. 2015 ), 34-36. https: //doi.org/10.1145/2658987 Google ScholarDigital Library
- J. Lung, J. Aranda, S. Easterbrook, and G. Wilson. 2008. On the dificulty of replicating human subjects studies in software engineering. In 2008 ACM/IEEE 30th International Conference on Software Engineering. 191-200. https://doi.org/ 10.1145/1368088.1368115 Google ScholarDigital Library
- Daniel Méndez Fernández, Wolfgang Böhm, Andreas Vogelsang, Jakob Mund, Manfred Broy, Marco Kuhrmann, and Thorsten Weyer. 2019. Artefacts in software engineering: a fundamental positioning. Software & Systems Modeling 18, 5 ( 2019 ), 2777-2786.Google Scholar
- Daniel Méndez Fernández, Martin Monperrus, Robert Feldt, and Thomas Zimmermann. 2019. The open science initiative of the Empirical Software Engineering journal. Empirical Software Engineering 24, 3 ( 2019 ), 1057-1060. https://doi.org/10.1007/s10664-019-09712-x Google ScholarDigital Library
- Gregorio Robles. 2010. Replicating MSR : A study of the potential replicability of papers published in the Mining Software Repositories proceedings. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010 ). 171-180. https://doi.org/10.1109/MSR. 2010.5463348 Google ScholarCross Ref
- Forrest J Shull, Jefrey C Carver, Sira Vegas, and Natalia Juristo. 2008. The role of replications in Empirical Software Engineering. Empirical Software Engineering 13, 2 ( 2008 ), 211-218. https://doi.org/10.1007/s10664-008-9060-1 Google ScholarDigital Library
- Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on Internal and External Validity in Empirical Software Engineering. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 9-19. https: //doi.org/10.1109/ICSE. 2015.24 Google ScholarCross Ref
- Christopher S. Timperley, Lauren Herckis, Claire Le Goues, and Michael Hilton. 2020. Understanding and Improving Artifact Sharing in Software Engineering Research. arXiv:cs.SE/ 2008.01046Google Scholar
- Chat Wacharamanotham, Lukas Eisenring, Steve Haroz, and Florian Echtler. 2020. Transparency of CHI Research Artifacts: Results of a Self-Reported Survey. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Machinery, New York, NY, USA, 1-14. https://doi.org/10.1145/3313831.3376448 Google ScholarDigital Library
Index Terms
- Community expectations for research artifacts and evaluation processes
Recommendations
Thoughts about Artifact Badging
Reproducibility: the extent to which consistent results are obtained when an experiment is repeated, is important as a means to validate experimental results, promote integrity of research, and accelerate follow up work. Commitment to artifact reviewing ...
An Artifact Evaluation of NDP
Artifact badging aims to rank the quality of submitted research artifacts and promote reproducibility. However, artifact badging may not indicate inherent design and evaluation limitations.
This work explores current limits in artifact badging using a ...
Evaluating the artifacts of SIGCOMM papers
A growing fraction of the papers published by CCR and at SIGCOMM-sponsored conferences include artifacts such as software or datasets. Besides CCR, these artifacts were rarely evaluated. During the last months of 2018, we organised two different ...
Comments