research-article

Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair

Authors:

Xin YiAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 3

Article No.: 65, Pages 1 - 37

https://doi.org/10.1145/3561382

Published: 27 April 2023 Publication History

Abstract

Evaluation is the foundation of automated program repair (APR), as it provides empirical evidence on strengths and weaknesses of APR techniques. However, the reliability of such evaluation is often threatened by various introduced biases. Consequently, bias exploration, which uncovers biases in the APR evaluation, has become a pivotal activity and performed since the early years when pioneer APR techniques were proposed. Unfortunately, there is still no methodology to support a systematic comprehension and discovery of evaluation biases in APR, which impedes the mitigation of such biases and threatens the evaluation of APR techniques.

In this work, we propose to systematically understand existing evaluation biases by rigorously conducting the first systematic literature review on existing known biases and systematically uncover new biases by building a taxonomy that categorizes evaluation biases. As a result, we identify 17 investigated biases and uncover a new bias in the usage of patch validation strategies. To validate this new bias, we devise and implement an executable framework APRConfig, based on which we evaluate three typical patch validation strategies with four representative heuristic-based and constraint-based APR techniques on three bug datasets. Overall, this article distills 13 findings for bias understanding, discovery, and validation. The systematic exploration we performed and the open source executable framework we proposed in this article provide new insights as well as an infrastructure for future exploration and mitigation of biases in APR evaluation.

References

[1]

Rui Abreu, Peter Zoeteweij, and Arjan J. C. Van Gemund. 2006. An evaluation of similarity coefficients for software fault localization. In Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06). IEEE, 39–46.

Digital Library

[2]

Rui Abreu, Peter Zoeteweij, and Arjan J. C. Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Proceedings of Testing: Academic and Industrial Conference Practice and Research Techniques–MUTATION (TAICPART-MUTATION’07). IEEE, 89–98.

Digital Library

[3]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE’11). IEEE, 1–10.

Digital Library

[4]

Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang. 3 (2019), 159:1–159:27. DOI:

Digital Library

[5]

Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 306–317.

Digital Library

[6]

Rohan Bavishi, Hiroaki Yoshida, and Mukul R Prasad. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 613–624.

Digital Library

[7]

Samuel Benton, Ali Ghanbari, and Lingming Zhang. 2019. Defexts: A curated dataset of reproducible real-world bugs for modern jvm languages. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion’19). IEEE, 47–50.

Digital Library

[8]

Zhiqiang Bian, Aymeric Blot, and Justyna Petke. 2021. Refining fitness functions for search-based program repair. In Proceedings of the IEEE/ACM International Workshop on Automated Program Repair (APR’21).

[9]

Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced? Bias in bug-fix datasets. In Proceedings of the 7th joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. 121–130.

Digital Library

[10]

Marcel Böhme and Abhik Roychoudhury. 2014. Corebench: Studying complexity of regression errors. In Proceedings of the International Symposium on Software Testing and Analysis. 105–115.

Digital Library

[11]

Liushan Chen, Yu Pei, and Carlo A Furia. 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 637–647.

Digital Library

[12]

Liushan Chen, Yu Pei, and Carlo Alberto Furia. 2021. Contract-based program repair without the contracts: An extended study. IEEE Trans. Softw. Eng. 47, 12 (2021), 2841–2857. DOI:

[13]

Arnaud Chevallier. 2016. Strategic Thinking in Complex Problem Solving. Oxford University Press.

[14]

Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. Flakime: Laboratory-controlled test flakiness impact assessment. a case study on mutation testing and program repair. CoRR, abs/1912.03197 (2019). http://arxiv.org/abs/1912.03197.

[15]

Benoit Cornu, Thomas Durieux, Lionel Seinturier, and Martin Monperrus. 2015. Npefix: Automatic runtime repair of null pointer exceptions in java. CoRR, abs/1512.07423 (2015). http://arxiv.org/abs/1512.07423.

[16]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

Digital Library

[17]

Heleno de S. Campos Junior, Marco Antônio P. Araújo, José Maria N. David, Regina Braga, Fernanda Campos, and Victor Ströele. 2017. Test case prioritization: A systematic review and mapping of the literature. In Proceedings of the 31st Brazilian Symposium on Software Engineering. 34–43.

Digital Library

[18]

Thomas Durieux, Benoit Cornu, Lionel Seinturier, and Martin Monperrus. 2017. Dynamic patch generation for null pointer exceptions using metaprogramming. In Proceedings of the IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER’17). IEEE, 349–358.

[19]

Thomas Durieux, Benjamin Danglot, Zhongxing Yu, Matias Martinez, Simon Urli, and Martin Monperrus. 2017. The patches of the nopol automatic repair system on the bugs of defects4j version 1.1. 0. Research Report. hal-01480084. Université Lille 1 - Sciences et Technologies.

[20]

Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical review of Java program repair tools: A large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 302–313.

Digital Library

[21]

Thomas Durieux and Martin Monperrus. 2016. Dynamoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test. 85–91.

Digital Library

[22]

Thomas Durieux and Martin Monperrus. 2016. IntroClassJava: A benchmark of 297 small and buggy Java programs. Research Report. hal-01272126. Universite Lille 1. https://hal.archives-ouvertes.fr/hal-01272126/file/main.pdf.

[23]

Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive biases in information systems research: A scientometric analysis. In 22st European Conference on Information Systems, ECIS 2014, Tel Aviv, Israel, June 9-11, 2014, Michel Avital, Jan Marco Leimeister, and Ulrike Schultze (Eds.). http://aisel.aisnet.org/ecis2014/proceedings/track02/5.

[24]

Andrew Forward and Timothy C. Lethbridge. 2008. A taxonomy of software types to facilitate search and evidence-based software engineering. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds. 179–191.

Digital Library

[25]

Paul Galdas. 2017. Revisiting bias in qualitative research: Reflections on its relationship with funding and impact. International Journal of Qualitative Methods 16, 1 (2017), 1–2.

[26]

Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2017. Automatic software repair: A survey. IEEE Trans. Softw. Eng. 45, 1 (2017), 34–67.

Digital Library

[27]

Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical program repair via bytecode mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 19–30.

Digital Library

[28]

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM 62, 12 (2019), 56–65.

Digital Library

[29]

He, Ye, Matias, Martinez, Thomas, Durieux, Martin, and Monperrus. 2019. A comprehensive study of automatic program repair on the QuixBugs benchmark. In Proceedings of the IEEE 1st International Workshop on Intelligent Bug Fixing (IBF’19).

[30]

Y. He, M. Martinez, T. Durieux, and M. Monperrus. 2021. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 171 (2021), 110825.

[31]

Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. 12–23.

Digital Library

[32]

Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1110–1121.

Digital Library

[33]

Jiajun Jiang, Luyao Ren, Yingfei Xiong, and Lingming Zhang. 2019. Inferring program transformations from singular examples via big code. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). IEEE, 255–266.

Digital Library

[34]

Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 298–309.

Digital Library

[35]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-aware neural machine translation for automatic program repair. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1161–1173.

Digital Library

[36]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis. 437–440.

Digital Library

[37]

Maria Kechagia, Sergey Mechtaev, Federica Sarro, and Mark Harman. 2022. Evaluating automatic program repair capabilities to repair API misuses. IEEE Trans. Softw. Eng. 48, 7 (2022), 2658–2679. DOI:

[38]

Staffs Keele et al. 2007. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Citeseer.

[39]

Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, 802–811.

[40]

Jindae Kim and Sunghun Kim. 2019. Automatic patch generation with context-based change application. Emp. Softw. Eng. 24, 6 (2019), 4071–4106.

[41]

Barbara Kitchenham, O. Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering–a systematic literature review. Inf. Softw. Technol. 51, 1 (2009), 7–15.

Digital Library

[42]

Barbara Kitchenham, Rialette Pretorius, David Budgen, O. Pearl Brereton, Mark Turner, Mahmood Niazi, and Stephen Linkman. 2010. Systematic literature reviews in software engineering—A tertiary study. Inf. Softw. Technol. 52, 8 (2010), 792–805.

Digital Library

[43]

Pavneet Singh Kochhar, Yuan Tian, and David Lo. 2014. Potential biases in bug localization: Do they matter? In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. 803–814.

Digital Library

[44]

Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F. Bissyandé, and Jacques Klein. 2018. Automated testing of android apps: A systematic literature review. IEEE Trans. Reliabil. 68, 1 (2018), 45–66.

[45]

Xianglong Kong, Lingming Zhang, W. Eric Wong, and Bixin Li. 2015. Experience report: How do techniques, programs, and tests impact automated program repair? In Proceedings of the IEEE 26th International Symposium on Software Reliability Engineering (ISSRE’15). IEEE, 194–204.

Digital Library

[46]

Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2020. Fixminer: Mining relevant fix patterns for automated program repair. Emp. Softw. Eng. 25, 3 (2020), 1980–2024. DOI:

Digital Library

[47]

Barbara H. Kwasnik. 1999. The role of classification in knowledge representation and discovery. Libr. Trends 48, 1 (1999). http://alexia.lis.uiuc.edu/puboff/catalog/trends/48_1abs.html#kwasnik.

[48]

Ryan Lawler. 2012. How do you hire great engineers? Give them a challenge. https://gigaom.com/2012/01/19/quixey-challenge/.

[49]

Dinh Xuan Bach Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina Pasareanu. 2019. On reliability of patch correctness assessment. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 524–535.

Digital Library

[50]

Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History driven program repair. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER’16), Vol. 1. IEEE, 213–224.

[51]

Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering (ICSE’12). IEEE, 3–13.

[52]

Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Trans. Softw. Eng. 41, 12 (2015), 1236–1256.

Digital Library

[53]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair. IEEE Trans. Softw. Eng. 38, 1 (2011), 54–72.

Digital Library

[54]

Li Li, Tegawendé F. Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Inf. Softw. Technol. 88 (2017), 67–95.

Digital Library

[55]

Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. DLFix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42th International Conference on Software Engineering. IEEE, 602–614.

Digital Library

[56]

Bo Lin, Shangwen Wang, Ming Wen, and Xiaoguang Mao. 2022. Context-aware code change embedding for better patch correctness assessment. ACM Trans. Softw. Eng. Methodol. 31, 3 (2022), 1–29.

Digital Library

[57]

Bo Lin, Shangwen Wang, Ming Wen, Zhang Zhang, Hongjun Wu, Yihao Qin, and Xiaoguang Mao. 2020. Understanding the non-repairability factors of automated program repair techniques. In Proceedings of the 27th Asia-Pacific Software Engineering Conference.

[58]

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 55–56.

Digital Library

[59]

Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST’19). IEEE, 102–113.

[60]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 1–12.

[61]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.

Digital Library

[62]

Kui Liu, Anil Koyuncu, Kisub Kim, Dongsun Kim, and Tegawendé F Bissyandé. 2018. LSRepair: Live search of fix ingredients for automated program repair. In Proceedings of the 25th Asia-Pacific Software Engineering Conference (APSEC’18). IEEE, 658–662.

[63]

Kui Liu, Li Li, Anil Koyuncu, Dongsun Kim, Zhe Liu, Jacques Klein, and Tegawendé F. Bissyandé. 2021. A critical review on the evaluation of automated program repair systems. J. Syst. Softw. 171 (2021), 110817.

[64]

Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé François D. Assise Bissyande, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for Java programs. In Proceedings of the 42nd ACM/IEEE International Conference on Software Engineering (ICSE’20).

Digital Library

[65]

Xuliang Liu and Hao Zhong. 2018. Mining stackoverflow for program repair. In Proceedings of the IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). IEEE, 118–129.

[66]

Yepang Liu, Jue Wang, Lili Wei, Chang Xu, Shing-Chi Cheung, Tianyong Wu, Jun Yan, and Jian Zhang. 2019. DroidLeaks: A comprehensive database of resource leaks in Android apps. Emp. Softw. Eng. 24, 6 (2019), 3435–3483.

[67]

Giuliano Lorenzoni, Paulo Alencar, Nathalia Nascimento, and Donald Cowan. 2021. Machine learning model development from a software engineering perspective: A systematic literature review. CoRR, abs/2102.07574 (2021). https://arxiv.org/abs/2102.07574.

[68]

Yiling Lou, Samuel Benton, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. How does regression test selection affect program repair? An extensive study on 2 million patches. CoRR, abs/2105.07311 (2021). https://arxiv.org/abs/2105.07311.

[69]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–114.

Digital Library

[70]

Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An extensible Java bug benchmark for automatic program repair studies. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 468–478.

[71]

Amirabbas Majd, Mojtaba Vahidi-Asl, Alireza Khalilian, Ahmad Baraani-Dastjerdi, and Bahman Zamani. 2019. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. J. Comput. Lang. 53 (2019), 38–52.

[72]

Alexandru Marginean, Johannes Bader, Satish Chandra, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, and Andrew Scott. 2019. Sapfix: Automated end-to-end repair at scale. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’19). IEEE, 269–278.

Digital Library

[73]

Matias Martinez and Martin Monperrus. 2015. Mining software repair models for reasoning on the search space of automated program fixing. Emp. Softw. Eng. 20, 1 (2015), 176–205.

Digital Library

[74]

Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 441–444.

Digital Library

[75]

Matias Martinez and Martin Monperrus. 2018. Ultra-large repair search space with automatically mined templates: The cardumen mode of astor. In International Symposium on Search Based Software Engineering. Springer, 65–86.

[76]

Barbara Minto. 2009. The Pyramid Principle: Logic in Writing and Thinking. Pearson Education.

[77]

Rahul Mohanani, Iflaah Salman, Burak Turhan, Pilar Rodríguez, and Paul Ralph. 2018. Cognitive biases in software engineering: A systematic mapping study. IEEE Trans. Softw. Eng. 46, 12 (2018), 1318–1339.

[78]

Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: Essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the 36th International Conference on Software Engineering. 234–242.

Digital Library

[79]

Martin Monperrus. 2018. Automatic software repair: A bibliography. ACM Comput. Surv. 51, 1 (2018), 1–24.

Digital Library

[80]

Martin Monperrus. 2020. The living review on automated program repair. Technical Report. hal-01956501. HAL Archives Ouvertes. https://hal.archives-ouvertes.fr/hal-01956501v4/file/repair-living-review.pdf.

[81]

Martin Monperrus, Simon Urli, Thomas Durieux, Matias Martinez, Benoit Baudry, and Lionel Seinturier. 2019. Repairnator patches programs automatically. Ubiquity 2019(July2019), 1–12.

Digital Library

[82]

Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018. Do automated program repair techniques repair hard and important bugs? Emp. Softw. Eng. 23, 5 (2018), 2901–2947.

Digital Library

[83]

Institute of Electrical and Electronics Engineers. 1987. IEEE Standard Taxonomy for Software Engineering Standards.

[84]

Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and improving fault localization. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 609–620.

Digital Library

[85]

Yuhua Qi, Wenhong Liu, Weixiang Zhang, and Deheng Yang. 2018. How to measure the performance of automated program repair. In Proceedings of the 5th International Conference on Information Science and Control Engineering (ICISCE’18). IEEE, 246–250.

[86]

Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. 254–265.

Digital Library

[87]

Yuhua Qi, Xiaoguang Mao, Yan Lei, and Chengsong Wang. 2013. Using automated program repair for evaluating the effectiveness of fault localization techniques. In Proceedings of the International Symposium on Software Testing and Analysis. 191–201.

Digital Library

[88]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the International Symposium on Software Testing and Analysis. 24–36.

Digital Library

[89]

Yihao Qin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawendé F. Bissyandé. 2021. On the impact of flaky tests in automated program repair. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE, 295–306.

[90]

Abhik Roychoudhury and Yingfei Xiong. 2019. Automated program repair: A step towards software automation. Sci. Chin. Inf. Sci. 62, 10 (2019), 200103.

[91]

Ripon K Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R Prasad. 2018. Bugs.jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories. 10–13.

Digital Library

[92]

Ripon K. Saha, Hiroaki Yoshida, Mukul R. Prasad, Susumu Tokumoto, Kuniharu Takayama, and Isao Nanba. 2018. Elixir: An automated repair tool for Java programs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. 77–80.

Digital Library

[93]

Seemanta Saha et al. 2019. Harnessing evolution for multi-hunk program repair. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 13–24.

Digital Library

[94]

Yusra Shakeel, Jacob Krüger, Ivonne Von Nostitz-Wallwitz, Gunter Saake, and Thomas Leich. 2019. Automated selection and quality assessment of primary studies: A systematic literature review. J. Data Inf. Qual. 12, 1 (2019), 1–26.

[95]

André Silva, Matias Martinez, Benjamin Danglot, Davide Ginelli, and Martin Monperrus. 2021. FLACOCO: Fault localization for Java based on industry-grade coverage. CoRR, abs/2111.12513 (2021). https://arxiv.org/abs/2111.12513.

[96]

Darja Šmite, Claes Wohlin, Zane Galviņa, and Rafael Prikladnicki. 2014. An empirically based terminology and taxonomy for global software engineering. Emp. Softw. Eng. 19, 1 (2014), 105–153.

Digital Library

[97]

Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. 532–543.

Digital Library

[98]

Joanna Smith and Helen Noble. 2014. Bias in research. Evid.-bas. Nurs. 17, 4 (2014), 100–101.

[99]

Webb Stacy and Jean MacMillan. 1995. Cognitive bias in software engineering. Commun. ACM 38, 6 (1995), 57–63.

Digital Library

[100]

Shin Hwei Tan, Jooyong Yi, Sergey Mechtaev, Abhik Roychoudhury, et al. 2017. Codeflaws: A programming competition benchmark for evaluating automated program repair tools. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C’17). IEEE, 180–182.

[101]

Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically generated patches as debugging aids: A human study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 64–74.

Digital Library

[102]

Michael Unterkalmsteiner, Robert Feldt, and Tony Gorschek. 2014. A taxonomy for requirements engineering and software test alignment. ACM Trans. Softw. Eng. Methodol. 23, 2 (2014), 1–38.

Digital Library

[103]

Muhammad Usman, Ricardo Britto, Jürgen Börstler, and Emilia Mendes. 2017. Taxonomies in software engineering: A systematic mapping study and a revised taxonomy development method. Inf. Softw. Technol. 85 (2017), 43–59.

Digital Library

[104]

András Vargha and Harold D. Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat. 25, 2 (2000), 101–132.

[105]

Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020. Automated patch correctness assessment: How far are we? In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 968–980.

Digital Library

[106]

Shangwen Wang, Ming Wen, Xiaoguang Mao, and Deheng Yang. 2019. Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques. In Proceedings of the Evaluation and Assessment on Software Engineering. 260–266.

Digital Library

[107]

Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of the IEEE 31st International Conference on Software Engineering. IEEE, 364–374.

Digital Library

[108]

Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the IEEE/ACM 40th International Conference on Software Engineering (ICSE’18). IEEE, 1–11.

Digital Library

[109]

George R. Wheaton. 1968. Development of a taxonomy of human performance: A review of classificatory systems relating to tasks and performance. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.973.125&rep=rep1&type=pdf.

[110]

Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics. Springer, 196–202.

[111]

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.

[112]

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Trans. Softw. Eng. 42, 8 (2016), 707–740.

Digital Library

[113]

Qi Xin and Steven P. Reiss. 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 660–670.

Digital Library

[114]

Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 416–426.

Digital Library

[115]

Tongtong Xu, Liushan Chen, Yu Pei, Tian Zhang, Minxue Pan, and Carlo Alberto Furia. 2022. Restore: Retrospective fault localization enhancing automated program repair. IEEE Trans. Softw. Eng. 48, 2 (2022), 309–326.

Digital Library

[116]

Xuezheng Xu, Yulei Sui, Hua Yan, and Jingling Xue. 2019. VFix: Value-flow-guided precise program repair for null pointer dereferences. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 512–523.

Digital Library

[117]

Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Trans. Softw. Eng. 43, 1 (2016), 34–55.

Digital Library

[118]

Meng Yan, Xin Xia, Yuanrui Fan, Ahmed E. Hassan, David Lo, and Shanping Li. 2022. Just-in-time defect identification and localization: A two-phase framework. IEEE Trans. Softw. Eng. 48, 1 (2022), 82–101. DOI:

Digital Library

[119]

Deheng Yang. 2022. Artifact Page of Our Study. Retrieved from https://github.com/DehengYang/APRConfig, 2021.

[120]

Deheng Yang. 2022. An Extended Description of the 17 Known Biases. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/RQ1.3_bias_mitigation/detailed_explanation_of_the_17_known_biases.md.

[121]

Deheng Yang. 2022. The Guideline on How to Extend APRConfig. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/How_to_extend.md.

[122]

Deheng Yang. 2022. The Results of Our Investigation on Known Bias Mitigation. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/RQ1.3_bias_mitigation/results_of_investigation_on_known_bias_mitigation.md.

[123]

Deheng Yang. 2022. The Results of Quality Assessment. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/SLR_results/results_of_quality_assessment.md.

[124]

Deheng Yang, Yan Lei, Xiaoguang Mao, David Lo, Huan Xie, and Meng Yan. 2021. Is the ground truth really accurate? Dataset purification for automated program repair. In Proceedings of the IEEE 28th International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE.

[125]

He Ye, Matias Martinez, and Martin Monperrus. 2021. Automated patch assessment for program repair at scale. Emp. Softw. Eng. 26, 2 (2021), 1–38.

Digital Library

[126]

He Ye, Matias Martinez, and Martin Monperrus. 2021. Neural program repair with execution-based backpropagation. CoRR, abs/2105.04123 (2021). https://arxiv.org/abs/2105.04123.

[127]

He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural program repair with execution-based backpropagation. In Proceedings of the IEEE/ACM 44th International Conference on Software Engineering (ICSE’22). IEEE, 1506–1518.

Digital Library

[128]

Yuan Yuan and Wolfgang Banzhaf. 2020. ARJA: Automated repair of java programs via multi-objective genetic programming. IEEE Trans. Softw. Eng. 46, 10 (2020), 1040–1067.

[129]

Yuan Yuan and Wolfgang Banzhaf. 2020. Toward better evolutionary program repair: An integrated approach. ACM Trans. Softw. Eng. Methodol. 29, 1 (2020), 1–53.

Digital Library

[130]

Jie M. Zhang and Mark Harman. 2021. “Ignorance and prejudice” in software fairness. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1436–1447.

Digital Library

Cited By

Csuvik VHorváth DLajkó MVidács L(2025)GenProgJS: A Baseline System for Test-Based Automated Repair of JavaScript ProgramsIEEE Transactions on Software Engineering10.1109/TSE.2024.349779851:2(325-343)Online publication date: Feb-2025
https://doi.org/10.1109/TSE.2024.3497798
Huang KXu ZYang SSun HLi XYan ZZhang Y(2024)Evolving Paradigms in Automated Program Repair: Taxonomy, Challenges, and OpportunitiesACM Computing Surveys10.1145/369645057:2(1-43)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3696450
Yang ZLiu FYu ZKeung JLi JLiu SHong YMa XJin ZLi G(2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660778
Show More Cited By

Index Terms

Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Automatic programming
    2. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

ExpressAPR: Efficient Patch Validation for Java Automated Program Repair Systems
ASE '23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering

Automated program repair (APR) approaches suffer from long patch validation time, which limits their practical application and receives relatively low attention. The patch validation process repeatedly executes tests to filter patches, and has been ...
Contract-based program repair without the contracts
ASE '17: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering

Automated program repair (APR) is a promising approach to automatically fixing software bugs. Most APR techniques use tests to drive the repair process; this makes them readily applicable to realistic code bases, but also brings the risk of generating ...
Impact of Code Language Models on Automated Program Repair
ICSE '23: Proceedings of the 45th International Conference on Software Engineering

Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in many software tasks such as code completion, there ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 32, Issue 3

May 2023

937 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3594533

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2023

Online AM: 04 September 2022

Accepted: 26 August 2022

Revised: 19 August 2022

Received: 05 December 2021

Published in TOSEM Volume 32, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Major Key Project of PCL

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
605
Total Downloads

Downloads (Last 12 months)121
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Csuvik VHorváth DLajkó MVidács L(2025)GenProgJS: A Baseline System for Test-Based Automated Repair of JavaScript ProgramsIEEE Transactions on Software Engineering10.1109/TSE.2024.349779851:2(325-343)Online publication date: Feb-2025
https://doi.org/10.1109/TSE.2024.3497798
Huang KXu ZYang SSun HLi XYan ZZhang Y(2024)Evolving Paradigms in Automated Program Repair: Taxonomy, Challenges, and OpportunitiesACM Computing Surveys10.1145/369645057:2(1-43)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3696450
Yang ZLiu FYu ZKeung JLi JLiu SHong YMa XJin ZLi G(2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660778
Xie HLei YYan MLi SMao XYu YLo D(2024)Towards More Precise Coincidental Correctness Detection With Deep Semantic LearningIEEE Transactions on Software Engineering10.1109/TSE.2024.348189350:12(3265-3289)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3481893
Yu XRao JLiu LLin GHu WKeung JZhou JXiang J(2024)Improving effort-aware defect prediction by directly learning to rank software modulesInformation and Software Technology10.1016/j.infsof.2023.107250165:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107250
Li FYang PKeung JHu WLuo HYu X(2023)Revisiting ‘revisiting supervised methods for effort‐aware cross‐project defect prediction’IET Software10.1049/sfw2.1213317:4(472-495)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1049/sfw2.12133

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents