skip to main content
research-article

Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair

Published:27 April 2023Publication History
Skip Abstract Section

Abstract

Evaluation is the foundation of automated program repair (APR), as it provides empirical evidence on strengths and weaknesses of APR techniques. However, the reliability of such evaluation is often threatened by various introduced biases. Consequently, bias exploration, which uncovers biases in the APR evaluation, has become a pivotal activity and performed since the early years when pioneer APR techniques were proposed. Unfortunately, there is still no methodology to support a systematic comprehension and discovery of evaluation biases in APR, which impedes the mitigation of such biases and threatens the evaluation of APR techniques.

In this work, we propose to systematically understand existing evaluation biases by rigorously conducting the first systematic literature review on existing known biases and systematically uncover new biases by building a taxonomy that categorizes evaluation biases. As a result, we identify 17 investigated biases and uncover a new bias in the usage of patch validation strategies. To validate this new bias, we devise and implement an executable framework APRConfig, based on which we evaluate three typical patch validation strategies with four representative heuristic-based and constraint-based APR techniques on three bug datasets. Overall, this article distills 13 findings for bias understanding, discovery, and validation. The systematic exploration we performed and the open source executable framework we proposed in this article provide new insights as well as an infrastructure for future exploration and mitigation of biases in APR evaluation.

REFERENCES

  1. [1] Abreu Rui, Zoeteweij Peter, and Gemund Arjan J. C. Van. 2006. An evaluation of similarity coefficients for software fault localization. In Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06). IEEE, 3946.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Abreu Rui, Zoeteweij Peter, and Gemund Arjan J. C. Van. 2007. On the accuracy of spectrum-based fault localization. In Proceedings of Testing: Academic and Industrial Conference Practice and Research Techniques–MUTATION (TAICPART-MUTATION’07). IEEE, 8998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Arcuri Andrea and Briand Lionel. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE’11). IEEE, 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bader Johannes, Scott Andrew, Pradel Michael, and Chandra Satish. 2019. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang. 3 (2019), 159:1–159:27. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Barr Earl T., Brun Yuriy, Devanbu Premkumar, Harman Mark, and Sarro Federica. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 306317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bavishi Rohan, Yoshida Hiroaki, and Prasad Mukul R. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 613624.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Benton Samuel, Ghanbari Ali, and Zhang Lingming. 2019. Defexts: A curated dataset of reproducible real-world bugs for modern jvm languages. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion’19). IEEE, 4750.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Bian Zhiqiang, Blot Aymeric, and Petke Justyna. 2021. Refining fitness functions for search-based program repair. In Proceedings of the IEEE/ACM International Workshop on Automated Program Repair (APR’21).Google ScholarGoogle Scholar
  9. [9] Bird Christian, Bachmann Adrian, Aune Eirik, Duffy John, Bernstein Abraham, Filkov Vladimir, and Devanbu Premkumar. 2009. Fair and balanced? Bias in bug-fix datasets. In Proceedings of the 7th joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. 121130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Böhme Marcel and Roychoudhury Abhik. 2014. Corebench: Studying complexity of regression errors. In Proceedings of the International Symposium on Software Testing and Analysis. 105115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Liushan, Pei Yu, and Furia Carlo A. 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 637647.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chen Liushan, Pei Yu, and Furia Carlo Alberto. 2021. Contract-based program repair without the contracts: An extended study. IEEE Trans. Softw. Eng. 47, 12 (2021), 2841–2857. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chevallier Arnaud. 2016. Strategic Thinking in Complex Problem Solving. Oxford University Press.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Cordy Maxime, Rwemalika Renaud, Papadakis Mike, and Harman Mark. 2019. Flakime: Laboratory-controlled test flakiness impact assessment. a case study on mutation testing and program repair. CoRR, abs/1912.03197 (2019). http://arxiv.org/abs/1912.03197.Google ScholarGoogle Scholar
  15. [15] Cornu Benoit, Durieux Thomas, Seinturier Lionel, and Monperrus Martin. 2015. Npefix: Automatic runtime repair of null pointer exceptions in java. CoRR, abs/1512.07423 (2015). http://arxiv.org/abs/1512.07423.Google ScholarGoogle Scholar
  16. [16] Moura Leonardo De and Bjørner Nikolaj. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Junior Heleno de S. Campos, Araújo Marco Antônio P., David José Maria N., Braga Regina, Campos Fernanda, and Ströele Victor. 2017. Test case prioritization: A systematic review and mapping of the literature. In Proceedings of the 31st Brazilian Symposium on Software Engineering. 3443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Durieux Thomas, Cornu Benoit, Seinturier Lionel, and Monperrus Martin. 2017. Dynamic patch generation for null pointer exceptions using metaprogramming. In Proceedings of the IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER’17). IEEE, 349358.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Durieux Thomas, Danglot Benjamin, Yu Zhongxing, Martinez Matias, Urli Simon, and Monperrus Martin. 2017. The patches of the nopol automatic repair system on the bugs of defects4j version 1.1. 0. Research Report. hal-01480084. Université Lille 1 - Sciences et Technologies.Google ScholarGoogle Scholar
  20. [20] Durieux Thomas, Madeiral Fernanda, Martinez Matias, and Abreu Rui. 2019. Empirical review of Java program repair tools: A large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 302313.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Durieux Thomas and Monperrus Martin. 2016. Dynamoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test. 8591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Durieux Thomas and Monperrus Martin. 2016. IntroClassJava: A benchmark of 297 small and buggy Java programs. Research Report. hal-01272126. Universite Lille 1. https://hal.archives-ouvertes.fr/hal-01272126/file/main.pdf.Google ScholarGoogle Scholar
  23. [23] Fleischmann Marvin, Amirpur Miglena, Benlian Alexander, and Hess Thomas. 2014. Cognitive biases in information systems research: A scientometric analysis. In 22st European Conference on Information Systems, ECIS 2014, Tel Aviv, Israel, June 9-11, 2014, Michel Avital, Jan Marco Leimeister, and Ulrike Schultze (Eds.). http://aisel.aisnet.org/ecis2014/proceedings/track02/5.Google ScholarGoogle Scholar
  24. [24] Forward Andrew and Lethbridge Timothy C.. 2008. A taxonomy of software types to facilitate search and evidence-based software engineering. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds. 179191.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Galdas Paul. 2017. Revisiting bias in qualitative research: Reflections on its relationship with funding and impact. International Journal of Qualitative Methods 16, 1 (2017), 1–2.Google ScholarGoogle Scholar
  26. [26] Gazzola Luca, Micucci Daniela, and Mariani Leonardo. 2017. Automatic software repair: A survey. IEEE Trans. Softw. Eng. 45, 1 (2017), 3467.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Ghanbari Ali, Benton Samuel, and Zhang Lingming. 2019. Practical program repair via bytecode mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 1930.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Goues Claire Le, Pradel Michael, and Roychoudhury Abhik. 2019. Automated program repair. Commun. ACM 62, 12 (2019), 5665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] He, Ye, Matias, Martinez, Thomas, Durieux, Martin, and Monperrus. 2019. A comprehensive study of automatic program repair on the QuixBugs benchmark. In Proceedings of the IEEE 1st International Workshop on Intelligent Bug Fixing (IBF’19).Google ScholarGoogle Scholar
  30. [30] He Y., Martinez M., Durieux T., and Monperrus M.. 2021. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 171 (2021), 110825.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hua Jinru, Zhang Mengshi, Wang Kaiyuan, and Khurshid Sarfraz. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. 1223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Humbatova Nargiz, Jahangirova Gunel, Bavota Gabriele, Riccio Vincenzo, Stocco Andrea, and Tonella Paolo. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 11101121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Jiang Jiajun, Ren Luyao, Xiong Yingfei, and Zhang Lingming. 2019. Inferring program transformations from singular examples via big code. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). IEEE, 255266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Jiang Jiajun, Xiong Yingfei, Zhang Hongyu, Gao Qing, and Chen Xiangqun. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 298309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Jiang Nan, Lutellier Thibaud, and Tan Lin. 2021. CURE: Code-aware neural machine translation for automatic program repair. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 11611173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Just René, Jalali Darioush, and Ernst Michael D.. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis. 437440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Kechagia Maria, Mechtaev Sergey, Sarro Federica, and Harman Mark. 2022. Evaluating automatic program repair capabilities to repair API misuses. IEEE Trans. Softw. Eng. 48, 7 (2022), 2658–2679. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Keele Staffs et al. 2007. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Citeseer.Google ScholarGoogle Scholar
  39. [39] Kim Dongsun, Nam Jaechang, Song Jaewoo, and Kim Sunghun. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, 802811.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Kim Jindae and Kim Sunghun. 2019. Automatic patch generation with context-based change application. Emp. Softw. Eng. 24, 6 (2019), 40714106.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Kitchenham Barbara, Brereton O. Pearl, Budgen David, Turner Mark, Bailey John, and Linkman Stephen. 2009. Systematic literature reviews in software engineering–a systematic literature review. Inf. Softw. Technol. 51, 1 (2009), 715.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Kitchenham Barbara, Pretorius Rialette, Budgen David, Brereton O. Pearl, Turner Mark, Niazi Mahmood, and Linkman Stephen. 2010. Systematic literature reviews in software engineering—A tertiary study. Inf. Softw. Technol. 52, 8 (2010), 792805.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Kochhar Pavneet Singh, Tian Yuan, and Lo David. 2014. Potential biases in bug localization: Do they matter? In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. 803814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Kong Pingfan, Li Li, Gao Jun, Liu Kui, Bissyandé Tegawendé F., and Klein Jacques. 2018. Automated testing of android apps: A systematic literature review. IEEE Trans. Reliabil. 68, 1 (2018), 4566.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Kong Xianglong, Zhang Lingming, Wong W. Eric, and Li Bixin. 2015. Experience report: How do techniques, programs, and tests impact automated program repair? In Proceedings of the IEEE 26th International Symposium on Software Reliability Engineering (ISSRE’15). IEEE, 194204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Koyuncu Anil, Liu Kui, Bissyandé Tegawendé F, Kim Dongsun, Klein Jacques, Monperrus Martin, and Traon Yves Le. 2020. Fixminer: Mining relevant fix patterns for automated program repair. Emp. Softw. Eng. 25, 3 (2020), 1980–2024. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Kwasnik Barbara H.. 1999. The role of classification in knowledge representation and discovery. Libr. Trends 48, 1 (1999). http://alexia.lis.uiuc.edu/puboff/catalog/trends/48_1abs.html#kwasnik.Google ScholarGoogle Scholar
  48. [48] Lawler Ryan. 2012. How do you hire great engineers? Give them a challenge. https://gigaom.com/2012/01/19/quixey-challenge/.Google ScholarGoogle Scholar
  49. [49] Le Dinh Xuan Bach, Bao Lingfeng, Lo David, Xia Xin, Li Shanping, and Pasareanu Corina. 2019. On reliability of patch correctness assessment. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 524535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Le Xuan Bach D., Lo David, and Goues Claire Le. 2016. History driven program repair. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER’16), Vol. 1. IEEE, 213224.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Goues Claire Le, Dewey-Vogt Michael, Forrest Stephanie, and Weimer Westley. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering (ICSE’12). IEEE, 313.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Goues Claire Le, Holtschulte Neal, Smith Edward K., Brun Yuriy, Devanbu Premkumar, Forrest Stephanie, and Weimer Westley. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Trans. Softw. Eng. 41, 12 (2015), 12361256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Goues Claire Le, Nguyen ThanhVu, Forrest Stephanie, and Weimer Westley. 2011. Genprog: A generic method for automatic software repair. IEEE Trans. Softw. Eng. 38, 1 (2011), 5472.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Li Li, Bissyandé Tegawendé F., Papadakis Mike, Rasthofer Siegfried, Bartel Alexandre, Octeau Damien, Klein Jacques, and Traon Le. 2017. Static analysis of android apps: A systematic literature review. Inf. Softw. Technol. 88 (2017), 6795.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Li Yi, Wang Shaohua, and Nguyen Tien N.. 2020. DLFix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42th International Conference on Software Engineering. IEEE, 602614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Lin Bo, Wang Shangwen, Wen Ming, and Mao Xiaoguang. 2022. Context-aware code change embedding for better patch correctness assessment. ACM Trans. Softw. Eng. Methodol. 31, 3 (2022), 129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Lin Bo, Wang Shangwen, Wen Ming, Zhang Zhang, Wu Hongjun, Qin Yihao, and Mao Xiaoguang. 2020. Understanding the non-repairability factors of automated program repair techniques. In Proceedings of the 27th Asia-Pacific Software Engineering Conference.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Lin Derrick, Koppel James, Chen Angela, and Solar-Lezama Armando. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 5556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Liu Kui, Koyuncu Anil, Bissyandé Tegawendé F., Kim Dongsun, Klein Jacques, and Traon Yves Le. 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST’19). IEEE, 102113.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Liu Kui, Koyuncu Anil, Kim Dongsun, and Bissyandé Tegawendé F.. 2019. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 112.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Liu Kui, Koyuncu Anil, Kim Dongsun, and Bissyandé Tegawendé F.. 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 3142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Liu Kui, Koyuncu Anil, Kim Kisub, Kim Dongsun, and Bissyandé Tegawendé F. 2018. LSRepair: Live search of fix ingredients for automated program repair. In Proceedings of the 25th Asia-Pacific Software Engineering Conference (APSEC’18). IEEE, 658662.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Liu Kui, Li Li, Koyuncu Anil, Kim Dongsun, Liu Zhe, Klein Jacques, and Bissyandé Tegawendé F.. 2021. A critical review on the evaluation of automated program repair systems. J. Syst. Softw. 171 (2021), 110817.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Liu Kui, Wang Shangwen, Koyuncu Anil, Kim Kisub, Bissyande Tegawendé François D. Assise, Kim Dongsun, Wu Peng, Klein Jacques, Mao Xiaoguang, and Traon Yves Le. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for Java programs. In Proceedings of the 42nd ACM/IEEE International Conference on Software Engineering (ICSE’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Liu Xuliang and Zhong Hao. 2018. Mining stackoverflow for program repair. In Proceedings of the IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). IEEE, 118129.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Liu Yepang, Wang Jue, Wei Lili, Xu Chang, Cheung Shing-Chi, Wu Tianyong, Yan Jun, and Zhang Jian. 2019. DroidLeaks: A comprehensive database of resource leaks in Android apps. Emp. Softw. Eng. 24, 6 (2019), 34353483.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Lorenzoni Giuliano, Alencar Paulo, Nascimento Nathalia, and Cowan Donald. 2021. Machine learning model development from a software engineering perspective: A systematic literature review. CoRR, abs/2102.07574 (2021). https://arxiv.org/abs/2102.07574.Google ScholarGoogle Scholar
  68. [68] Lou Yiling, Benton Samuel, Hao Dan, Zhang Lu, and Zhang Lingming. 2021. How does regression test selection affect program repair? An extensive study on 2 million patches. CoRR, abs/2105.07311 (2021). https://arxiv.org/abs/2105.07311.Google ScholarGoogle Scholar
  69. [69] Lutellier Thibaud, Pham Hung Viet, Pang Lawrence, Li Yitong, Wei Moshi, and Tan Lin. 2020. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Madeiral Fernanda, Urli Simon, Maia Marcelo, and Monperrus Martin. 2019. Bears: An extensible Java bug benchmark for automatic program repair studies. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 468478.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Majd Amirabbas, Vahidi-Asl Mojtaba, Khalilian Alireza, Baraani-Dastjerdi Ahmad, and Zamani Bahman. 2019. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. J. Comput. Lang. 53 (2019), 3852.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Marginean Alexandru, Bader Johannes, Chandra Satish, Harman Mark, Jia Yue, Mao Ke, Mols Alexander, and Scott Andrew. 2019. Sapfix: Automated end-to-end repair at scale. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’19). IEEE, 269278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Martinez Matias and Monperrus Martin. 2015. Mining software repair models for reasoning on the search space of automated program fixing. Emp. Softw. Eng. 20, 1 (2015), 176205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Martinez Matias and Monperrus Martin. 2016. Astor: A program repair library for java. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 441444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Martinez Matias and Monperrus Martin. 2018. Ultra-large repair search space with automatically mined templates: The cardumen mode of astor. In International Symposium on Search Based Software Engineering. Springer, 6586.Google ScholarGoogle Scholar
  76. [76] Minto Barbara. 2009. The Pyramid Principle: Logic in Writing and Thinking. Pearson Education.Google ScholarGoogle Scholar
  77. [77] Mohanani Rahul, Salman Iflaah, Turhan Burak, Rodríguez Pilar, and Ralph Paul. 2018. Cognitive biases in software engineering: A systematic mapping study. IEEE Trans. Softw. Eng. 46, 12 (2018), 13181339.Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Monperrus Martin. 2014. A critical review of “automatic patch generation learned from human-written patches”: Essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the 36th International Conference on Software Engineering. 234242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Monperrus Martin. 2018. Automatic software repair: A bibliography. ACM Comput. Surv. 51, 1 (2018), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Monperrus Martin. 2020. The living review on automated program repair. Technical Report. hal-01956501. HAL Archives Ouvertes. https://hal.archives-ouvertes.fr/hal-01956501v4/file/repair-living-review.pdf.Google ScholarGoogle Scholar
  81. [81] Monperrus Martin, Urli Simon, Durieux Thomas, Martinez Matias, Baudry Benoit, and Seinturier Lionel. 2019. Repairnator patches programs automatically. Ubiquity 2019(July2019), 112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Motwani Manish, Sankaranarayanan Sandhya, Just René, and Brun Yuriy. 2018. Do automated program repair techniques repair hard and important bugs? Emp. Softw. Eng. 23, 5 (2018), 29012947.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Electrical Institute of and Engineers Electronics. 1987. IEEE Standard Taxonomy for Software Engineering Standards.Google ScholarGoogle Scholar
  84. [84] Pearson Spencer, Campos José, Just René, Fraser Gordon, Abreu Rui, Ernst Michael D., Pang Deric, and Keller Benjamin. 2017. Evaluating and improving fault localization. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 609620.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Qi Yuhua, Liu Wenhong, Zhang Weixiang, and Yang Deheng. 2018. How to measure the performance of automated program repair. In Proceedings of the 5th International Conference on Information Science and Control Engineering (ICISCE’18). IEEE, 246250.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Qi Yuhua, Mao Xiaoguang, Lei Yan, Dai Ziying, and Wang Chengsong. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. 254265.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Qi Yuhua, Mao Xiaoguang, Lei Yan, and Wang Chengsong. 2013. Using automated program repair for evaluating the effectiveness of fault localization techniques. In Proceedings of the International Symposium on Software Testing and Analysis. 191201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Qi Zichao, Long Fan, Achour Sara, and Rinard Martin. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the International Symposium on Software Testing and Analysis. 2436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. [89] Qin Yihao, Wang Shangwen, Liu Kui, Mao Xiaoguang, and Bissyandé Tegawendé F.. 2021. On the impact of flaky tests in automated program repair. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE, 295306.Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Roychoudhury Abhik and Xiong Yingfei. 2019. Automated program repair: A step towards software automation. Sci. Chin. Inf. Sci. 62, 10 (2019), 200103.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Saha Ripon K, Lyu Yingjun, Lam Wing, Yoshida Hiroaki, and Prasad Mukul R. 2018. Bugs.jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories. 1013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. [92] Saha Ripon K., Yoshida Hiroaki, Prasad Mukul R., Tokumoto Susumu, Takayama Kuniharu, and Nanba Isao. 2018. Elixir: An automated repair tool for Java programs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. 7780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Saha Seemanta et al. 2019. Harnessing evolution for multi-hunk program repair. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1324.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. [94] Shakeel Yusra, Krüger Jacob, Nostitz-Wallwitz Ivonne Von, Saake Gunter, and Leich Thomas. 2019. Automated selection and quality assessment of primary studies: A systematic literature review. J. Data Inf. Qual. 12, 1 (2019), 126.Google ScholarGoogle Scholar
  95. [95] Silva André, Martinez Matias, Danglot Benjamin, Ginelli Davide, and Monperrus Martin. 2021. FLACOCO: Fault localization for Java based on industry-grade coverage. CoRR, abs/2111.12513 (2021). https://arxiv.org/abs/2111.12513.Google ScholarGoogle Scholar
  96. [96] Šmite Darja, Wohlin Claes, Galviņa Zane, and Prikladnicki Rafael. 2014. An empirically based terminology and taxonomy for global software engineering. Emp. Softw. Eng. 19, 1 (2014), 105153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. [97] Smith Edward K., Barr Earl T., Goues Claire Le, and Brun Yuriy. 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. 532543.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. [98] Smith Joanna and Noble Helen. 2014. Bias in research. Evid.-bas. Nurs. 17, 4 (2014), 100101.Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Stacy Webb and MacMillan Jean. 1995. Cognitive bias in software engineering. Commun. ACM 38, 6 (1995), 5763.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. [100] Tan Shin Hwei, Yi Jooyong, Mechtaev Sergey, Roychoudhury Abhik, et al. 2017. Codeflaws: A programming competition benchmark for evaluating automated program repair tools. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C’17). IEEE, 180182.Google ScholarGoogle Scholar
  101. [101] Tao Yida, Kim Jindae, Kim Sunghun, and Xu Chang. 2014. Automatically generated patches as debugging aids: A human study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 6474.Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. [102] Unterkalmsteiner Michael, Feldt Robert, and Gorschek Tony. 2014. A taxonomy for requirements engineering and software test alignment. ACM Trans. Softw. Eng. Methodol. 23, 2 (2014), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. [103] Usman Muhammad, Britto Ricardo, Börstler Jürgen, and Mendes Emilia. 2017. Taxonomies in software engineering: A systematic mapping study and a revised taxonomy development method. Inf. Softw. Technol. 85 (2017), 4359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. [104] Vargha András and Delaney Harold D.. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat. 25, 2 (2000), 101132.Google ScholarGoogle Scholar
  105. [105] Wang Shangwen, Wen Ming, Lin Bo, Wu Hongjun, Qin Yihao, Zou Deqing, Mao Xiaoguang, and Jin Hai. 2020. Automated patch correctness assessment: How far are we? In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 968980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. [106] Wang Shangwen, Wen Ming, Mao Xiaoguang, and Yang Deheng. 2019. Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques. In Proceedings of the Evaluation and Assessment on Software Engineering. 260266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. [107] Weimer Westley, Nguyen ThanhVu, Goues Claire Le, and Forrest Stephanie. 2009. Automatically finding patches using genetic programming. In Proceedings of the IEEE 31st International Conference on Software Engineering. IEEE, 364374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. [108] Wen Ming, Chen Junjie, Wu Rongxin, Hao Dan, and Cheung Shing-Chi. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the IEEE/ACM 40th International Conference on Software Engineering (ICSE’18). IEEE, 111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. [109] Wheaton George R.. 1968. Development of a taxonomy of human performance: A review of classificatory systems relating to tasks and performance. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.973.125&rep=rep1&type=pdf.Google ScholarGoogle Scholar
  110. [110] Wilcoxon Frank. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics. Springer, 196202.Google ScholarGoogle ScholarCross RefCross Ref
  111. [111] Wohlin Claes, Runeson Per, Höst Martin, Ohlsson Magnus C., Regnell Björn, and Wesslén Anders. 2012. Experimentation in Software Engineering. Springer Science & Business Media.Google ScholarGoogle ScholarCross RefCross Ref
  112. [112] Wong W. Eric, Gao Ruizhi, Li Yihao, Abreu Rui, and Wotawa Franz. 2016. A survey on software fault localization. IEEE Trans. Softw. Eng. 42, 8 (2016), 707740.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. [113] Xin Qi and Reiss Steven P.. 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 660670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. [114] Xiong Yingfei, Wang Jie, Yan Runfa, Zhang Jiachen, Han Shi, Huang Gang, and Zhang Lu. 2017. Precise condition synthesis for program repair. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 416426.Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. [115] Xu Tongtong, Chen Liushan, Pei Yu, Zhang Tian, Pan Minxue, and Furia Carlo Alberto. 2022. Restore: Retrospective fault localization enhancing automated program repair. IEEE Trans. Softw. Eng. 48, 2 (2022), 309–326.Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. [116] Xu Xuezheng, Sui Yulei, Yan Hua, and Xue Jingling. 2019. VFix: Value-flow-guided precise program repair for null pointer dereferences. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 512523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. [117] Xuan Jifeng, Martinez Matias, Demarco Favio, Clement Maxime, Marcote Sebastian Lamelas, Durieux Thomas, Berre Daniel Le, and Monperrus Martin. 2016. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Trans. Softw. Eng. 43, 1 (2016), 3455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. [118] Yan Meng, Xia Xin, Fan Yuanrui, Hassan Ahmed E., Lo David, and Li Shanping. 2022. Just-in-time defect identification and localization: A two-phase framework. IEEE Trans. Softw. Eng. 48, 1 (2022), 82101. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. [119] Yang Deheng. 2022. Artifact Page of Our Study. Retrieved from https://github.com/DehengYang/APRConfig, 2021.Google ScholarGoogle Scholar
  120. [120] Yang Deheng. 2022. An Extended Description of the 17 Known Biases. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/RQ1.3_bias_mitigation/detailed_explanation_of_the_17_known_biases.md.Google ScholarGoogle Scholar
  121. [121] Yang Deheng. 2022. The Guideline on How to Extend APRConfig. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/How_to_extend.md.Google ScholarGoogle Scholar
  122. [122] Yang Deheng. 2022. The Results of Our Investigation on Known Bias Mitigation. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/RQ1.3_bias_mitigation/results_of_investigation_on_known_bias_mitigation.md.Google ScholarGoogle Scholar
  123. [123] Yang Deheng. 2022. The Results of Quality Assessment. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/doc/SLR_results/results_of_quality_assessment.md.Google ScholarGoogle Scholar
  124. [124] Yang Deheng, Lei Yan, Mao Xiaoguang, Lo David, Xie Huan, and Yan Meng. 2021. Is the ground truth really accurate? Dataset purification for automated program repair. In Proceedings of the IEEE 28th International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  125. [125] Ye He, Martinez Matias, and Monperrus Martin. 2021. Automated patch assessment for program repair at scale. Emp. Softw. Eng. 26, 2 (2021), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. [126] Ye He, Martinez Matias, and Monperrus Martin. 2021. Neural program repair with execution-based backpropagation. CoRR, abs/2105.04123 (2021). https://arxiv.org/abs/2105.04123.Google ScholarGoogle Scholar
  127. [127] Ye He, Martinez Matias, and Monperrus Martin. 2022. Neural program repair with execution-based backpropagation. In Proceedings of the IEEE/ACM 44th International Conference on Software Engineering (ICSE’22). IEEE, 15061518.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. [128] Yuan Yuan and Banzhaf Wolfgang. 2020. ARJA: Automated repair of java programs via multi-objective genetic programming. IEEE Trans. Softw. Eng. 46, 10 (2020), 1040–1067.Google ScholarGoogle ScholarCross RefCross Ref
  129. [129] Yuan Yuan and Banzhaf Wolfgang. 2020. Toward better evolutionary program repair: An integrated approach. ACM Trans. Softw. Eng. Methodol. 29, 1 (2020), 153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. [130] Zhang Jie M. and Harman Mark. 2021. “Ignorance and prejudice” in software fairness. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 14361447.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Software Engineering and Methodology
        ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 3
        May 2023
        937 pages
        ISSN:1049-331X
        EISSN:1557-7392
        DOI:10.1145/3594533
        • Editor:
        • Mauro Pezzè
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 April 2023
        • Online AM: 4 September 2022
        • Accepted: 26 August 2022
        • Revised: 19 August 2022
        • Received: 5 December 2021
        Published in tosem Volume 32, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format