The impacts of techniques, programs and tests on automated program repair: An empirical study

https://doi.org/10.1016/j.jss.2017.06.039Get rights and content

Highlights

  • An extensive study on the impacts of several factors on program repair is proposed.

  • Performance of the techniques declines with the increase of program size.

  • Adding more passed tests cannot impact the real repair effectiveness.

  • Adding more failed tests is helpful to some extent for the deterministic techniques.

  • Four techniques find more than 80% of patches within the first 50% of search space.

Abstract

Manual program repair is notoriously tedious, error-prone, and costly, especially for the modern large-scale projects. Automated program repair can automatically find program patches without much human intervention, greatly reducing the burden of developers as well as accelerating software delivery. Therefore, much research effort has been dedicated to design powerful program repair techniques. To date, although various program repair techniques have been proposed, to our knowledge, there lacks extensive study on the impacts of repair techniques, subject programs, and test suites on the repair effectiveness and efficiency. In this paper, we perform such an extensive study on repairing 180 seeded and real faults from 17 small to large sized programs. We study the impacts of five representative automated program repair techniques, including GenProg, RSRepair, Brute-force-based technique, AE and Kali, on the repair results. We further investigate the impacts of different subject programs and test suites on effectiveness and efficiency of program repair techniques. Our study demonstrates a number of interesting findings: Brute-force-based technique generates the maximum number of patches but is also the most costly technique, while Kali is the most efficient and has medium effectiveness among the studied techniques; techniques that work well with small programs become too costly or ineffective when applied to large sized programs; since tool-reported patches may overfit the selected test cases, we calculate the false positive rates and find that the influence of failed test cases is much larger than that of passed test cases; finally, surprisingly, all the studied techniques except RSRepair can find more than 80% of successful patches within the first 50% of search space.

Introduction

Software bugs are prevalent in software development practice. As the current main methodology for fixing software bugs, manual program repair requires developers to manually reason about the potential buggy locations and further choose corresponding repair strategies based on various sources of information (e.g., bug reports, test execution results, and stack traces). Due to the various challenges in localizing the buggy locations and designing fixing strategies, manual program repair has been recognized as one of the most tedious and time-consuming work in software industry. Furthermore, manual program repair itself can be error-prone and may introduce new bugs. For example, Gu et al. (2010) performed an empirical study on the bug database of three projects, and found that 9% of the bugs are actually caused by buggy manual bug fixes.

Due to the limitations of manual program repair, a huge body of research efforts have been dedicated to automated program repair, which aims to automatically find program patches without much human intervention. Ideally, automated program repair can greatly reduce the burden of software developers in the debugging process, and can even help speed up software delivery. Therefore, in the recent decade, more and more automated program repair techniques have been proposed (Arcuri, Yao, 2008, Dallmeier, Zeller, Meyer, 2009, Samirni, Schäfer, Artzi, Millstein, Tip, Hendren, 2012, Kim, Nam, Song, Kim, 2013, Qi, Mao, Lei, Dai, Wang, 2014, Wei, Pei, Furia, Silva, Buchholz, Meyer, Zeller, 2010, Qi, Long, Achour, Rinard, 2015). GenProg (Weimer et al., 2009) pioneers the generate-and-validate techniques in automated program repair. To repair a faulty program, generate-and-validate techniques first rank the potential suspicious statements that may contain software bugs using off-the-shelf fault localization techniques; then they generate automated patches by manipulating suspicious statements in the ranked list (e.g., via statement removal, addition, and replacement); finally, the generate-and-validate techniques validate each generated patch by running the patched version with test cases in the original test suite – if test cases pass, the patch is then validated and reported to the user. Since the generate-and-validate approach is easy to apply, fully automated, and scalable to large-scale programs, a number of techniques have been proposed in this line of work after GenProg, such as AE (Weimer et al., 2013), RSRepair (Qi et al., 2014), Kali (Qi et al., 2015), and so on (Bradbury, Jalbert, 2010, Schulte, Forrest, Weimer, 2010, Arcuri, 2011).

Among all the generate-and-validate program repair techniques, there are five representative techniques for general program repair (i.e., not specifically designed for specific types of bugs and not dependent on history patch patterns): (1) GenProg, (2) RSRepair, (3) Brute-force-based technique, (4) AE and (5) Kali. RSRepair and Kali are proposed by Qi et al. (2014) and Qi et al. (2015), respectively, while Weimer et al.and Le Goues et al. proposed all the other three techniques in GenProg and its extensions (Le Goues, Dewey-Vogt, Forrest, Weimer, 2012, Le Goues, Nguyen, Forrest, Weimer, 2012, Weimer, Nguyen, Le Goues, Forrest, 2009, Weimer, Fry, Forrest, 2013). These techniques can be categorized into deterministic techniques (i.e., brute-force-based technique, AE, and Kali) and non-deterministic techniques (i.e., GenProg and RSRepair) based on whether they include randomization during the repair process. Four of the above techniques have been reported to produce promising results: GenProg is reported to fix 55 out of 105 real bugs; RSRepair is reported to fix 24 out of 24 bugs (the 24 bugs is a subset of the 55 bugs reported to be fixed successfully by GenProg); and AE is reported to fix 53 out of the 105 bugs. Kali is reported to fix at least as many bugs as the three above techniques. On the contrary, the brute-force-based technique, which is proposed in the second version of Le Goues et al. (2012a), has not been fully evaluated previously due to its high cost.

Although various general program repair techniques have been proposed, to our knowledge, there lacks extensive study on the impacts of repair techniques, subject programs, and test suites on automated program repair. Our previous study (Kong et al., 2015) has provided a good starting point by empirically comparing four program repair techniques (i.e., brute-force-based technique, AE, GenProg and RSRepair) on different sets of tests from 9 small to medium sized subjects. Based on that, we further investigate the impacts of these factors on more medium and large sized subjects with real faults, and we also consider another general program repair technique, i.e., Kali in our experiments. Since different techniques may have different performance on subject programs of different scales, our study evaluates all the five techniques on 180 buggy versions of 17 small to large sized C programs, ranging from 135 to 2159K lines of code. Furthermore, in generate-and-validate program repair techniques, test suites are essential for repair effectiveness and efficiency, since the fault localization step of program repair techniques is largely influenced by different test suites, e.g., different number of passed/failed test cases may produce different fault localization results, which in turn influence both the efficiency and effectiveness of the patch generation step. In addition, the patch validation process is also largely influenced by different test suites, since a larger test suite may produce more precise validation results while incurring more patch validation time. Therefore, we further investigate how the five main techniques perform in terms of both effectiveness and efficiency by controlling the number of passed/failed test cases in the test suite.

In summary, the paper makes the following contributions:

  • An extensive empirical study that considers the impacts of different repair techniques, subject programs, and test suites on general automated program repair.

  • An extensive empirical study that analyzes the preferences and limitations of automated program repair techniques.

  • An extensive empirical study that quantitatively computes the false positive rates, as well as real patch rates of automated program repair techniques.

  • The experimental results on 180 bugs of 17 subject programs provide various interesting findings and practical guidelines regarding the efficiency and effectiveness of automated program repair in different settings, e.g., different repair techniques, different programs and different test suites. For example, the results demonstrate that brute-force-based technique is the most effective but also the most costly, while Kali is the most efficient with medium effectiveness; techniques that work well with small programs become too costly or ineffective when applied to large sized programs; for deterministic techniques, increasing the number of passed tests cannot impact the real success rates, while increasing the number of failed tests to some extent can improve the real success rates; the experimental results further demonstrate that all the studied techniques except RSRepair can find more than 80% of successful patches within the first 50% of search space.

Section snippets

Background

Generate-and-validate techniques consist of three steps: fault localization, patch generation and patch validation. The fault localization step first analyzes the paths of both passed and failed test cases to generate a ranked list of suspicious statements that may be faulty. Then, the patch generation step generates candidate patches by manipulating suspicious statements in the ranked list. Manipulation rules used in these techniques contain statement addition, replacement and removal as well

Study approach

In this work, we aim to answer the following research questions:

  • RQ1: What’s the overall performance of the different studied techniques?

  • RQ2: How do the five techniques perform with different sizes of subject systems?

  • RQ3: How do the studied techniques perform on different test suites?

  • RQ4: Do the tool-reported patches really fix the bugs?

  • RQ5: Is there any boundary point so that when the techniques cannot find successful patch before it they also cannot succeed after it in most cases?

RQ1: overall performance

The overall repair results for all the five studied techniques on all the programs are shown in Table 2. In the table, Column 1 lists all the subject systems; Columns 2 to 6 present the overall repair success rates for the five techniques, i.e., the ratio of tool-reported patches to all the repair attempts for all faults and test suite constructions. From the table, we have the following observations.

First, we can divide the five studied techniques into three categories according to the overall

Automated program repair techniques

Current automated program repair techniques can be roughly divided into two categories according to the methods of producing candidate patches, generate-and-validate techniques and correct-by-construction techniques (Le Goues et al., 2015). Generate-and-validate techniques first rank the potential suspicious statements and generate candidate patches by manipulating suspicious statements with some specific rules, then they validate each candidate patch for correctness. Correct-by-construction

Conclusion

To investigate how different techniques, subject programs and test suites impact automated program repair in terms of effectiveness and efficiency, we conduct this empirical study. We have applied each of the five studied repair techniques on different test suites and faulty programs for more than 4000 times. The results show that success rates of brute-force-based technique and AE are higher than those of GenProg and RSRepair in most cases, while Kali usually costs the least time but has the

Acknowledgments

We would like to thank W. Weimer, C. Le Goues, and Y. Qi et al. for sharing the source code of GenProg and RSRepair with us. This work is sponsored partially by China Scholarship Council No. 201406090080, partially by National Natural Science Foundation of China under Grant No. 61572126, 61402103 and partially by Huawei Innovation Research Program (HIRP) under Grant No. YB2013120195.

Xianglong Kong is a joint Ph.D. student of Southeast University and The University of Texas at Dallas. He got his Bachelor’s degree in Computer Science from Southeast University (China) in 2009. He has studied under the supervision of Prof. W. Eric Wong and Lingming Zhang in Department of Computer Science, The University of Texas at Dallas from 2014 to 2016. He is studying under the supervision of Prof. Bixin Li in Software Engineering Institute, Southeast University.

References (52)

  • A. Arcuri

    Evolutionary repair of faulty software

    Appl. Soft Comput. J.

    (2011)
  • V. Debroy et al.

    Combining mutation and fault localization for automated program debugging

    J. Syst. Softw.

    (2014)
  • W.B. Langdon et al.

    Efficient multi-objective higher order mutation testing with genetic programming

    J.Syst. Softw.

    (2010)
  • A. Arcuri et al.

    A novel co-evolutionary approach to automatic software bug fixing

    Proceedings of 2008 IEEE Congress on Evolutionary Computation

    (2008)
  • E.T. Barr et al.

    The plastic surgery hypothesis

    Proceedings of the 22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering

    (2014)
  • J.S. Bradbury et al.

    Automatic repair of concurrency bugs

    Proceedings of the 2nd International Symposium on Search Based Software Engineering

    (2010)
  • M. Carbin et al.

    Detecting and escaping infinite loops with jolt

    Proceedings of the 25th European Conference on Object-Oriented Programming

    (2011)
  • V. Dallmeier et al.

    Generating fixes from object behavior anomalies

    Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering

    (2009)
  • V. Debroy et al.

    Using mutation to automatically suggest fixes for faulty programs

    Proceedings of the 3rd International Conference on Software Testing, Verification and Validation

    (2010)
  • F. DeMarco et al.

    Automatic repair of buggy if conditions and missing preconditions with smt

    Proceedings of the 6th International Workshop on Constraints in Software Testing, Verification, and Analysis

    (2014)
  • H. Do et al.

    Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact

    Empirical Softw. Eng. Int. J.

    (2005)
  • Durieux, T., Martinez, M., Monperrus, M., Sommerard, R., Xuan, J., 2015. Automatic repair of real bugs: an experience...
  • B. Elkarablieh et al.

    Juzi: a tool for repairing complex data structures

    Proceedings of the 30th international conference on Software engineering

    (2008)
  • Q. Gao et al.

    Safe memory-leak fixing for C programs

    Proceedings of the 37th International Conference on Software Engineering

    (2015)
  • Q. Gao et al.

    Fixing recurring crash bugs via analyzing q&a sites

    Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering

    (2015)
  • L.S. Ghandehari et al.

    Fault localization based on failure-inducing combinations

    Proceedings of the 24th International Symposium on Software Reliability Engineering

    (2013)
  • M. Gligoric et al.

    Comparing non-adequate test suites using coverage criteria

    Proceedings of the 2013 International Symposium on Software Testing and Analysis

    (2013)
  • Z. Gu et al.

    Has the bug really been fixed?

    Software Engineering, 2010 ACM/IEEE 32nd International Conference on

    (2010)
  • H. He et al.

    Automated debugging using path-based weakest preconditions

    Fundam. Approaches Softw. Eng.

    (2004)
  • G. Jin et al.

    Automated atomicity-violation fixing

    Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation

    (2011)
  • S. Kaleeswaran et al.

    Minthint: automated synthesis of repair hints

    Proceedings of the 36th International Conference on Software Engineering

    (2014)
  • D. Kim et al.

    Automatic patch generation learned from human-written patches

    Proceedings of the 35th International Conference on Software Engineering

    (2013)
  • X. Kong et al.

    Experience report: how do techniques, programs, and tests impact automated program repair?

    Proceedings of the 26th International Symposium on Software Reliability Engineering

    (2015)
  • J.R. Koza

    Genetic Programming: On the Programming of Computers by Means of Natural Selection

    (1992)
  • C. Le Goues et al.

    A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each

    Proceedings of the 34th International Conference on Software Engineering

    (2012)
  • C. Le Goues et al.

    The manybugs and introclass benchmarks for automated repair of c programs

    IEEE Trans. Softw. Eng.

    (2015)
  • Cited by (0)

    Xianglong Kong is a joint Ph.D. student of Southeast University and The University of Texas at Dallas. He got his Bachelor’s degree in Computer Science from Southeast University (China) in 2009. He has studied under the supervision of Prof. W. Eric Wong and Lingming Zhang in Department of Computer Science, The University of Texas at Dallas from 2014 to 2016. He is studying under the supervision of Prof. Bixin Li in Software Engineering Institute, Southeast University.

    Lingming Zhang is an assistant professor of University of Texas at Dallas. He got his Ph.D.’s degree in May 2014 from the Electrical & Computer Engineering Department at The University of Texas at Austin under the supervision of Prof. Sarfraz Khurshid. He received his Master’s degree in Computer Science from Peking University (China) in 2010 under the supervision of Prof. Lu Zhang. Before that, he got his Bachelor’s degree in Computer Science from Nanjing University (China) in 2007.

    W. Eric Wong received his M.S. and Ph.D. in Computer Science from Purdue University. He is a full professor and the founding director of the Advanced Research Center for Software Testing and Quality Assurance in Computer Science, University of Texas at Dallas (UTD). He also has an appointment as a guest researcher with National Institute of Standards and Technology (NIST), an agency of the US Department of Commerce. Prior to joining UTD, he was with Telcordia Technologies (formerly Bellcore) as a senior research scientist and the project manager in charge of Dependable Telecom Software Development. In 2014, he was named the IEEE Reliability Society Engineer of the Year. His research focuses on helping practitioners improve the quality of software while reducing the cost of production. In particular, he is working on software testing, debugging, risk analysis/metrics, safety, and reliability. He has very strong experience developing real-life industry applications of his research results. Professor Wong is the Editor-in-Chief of IEEE Transactions on Reliability. He is also the Founding Steering Committee Chair of the IEEE International Conference on Software Quality, Reliability, and Security (QRS) and the IEEE International Workshop on Program Debugging (IWPD).

    Bixin Li is a professor of Computer Science and Engineering School at the Southeast University, Nanjing, China. His research interests include: Program slicing and its application; Software evolution and maintenance; and Software modeling, analysis, testing and verification. He has published over 90 articles in refereed conferences and journals. He leads a Software Engineering Institute in Southeast University, and over 20 young men and women are hard working on national and international projects.

    View full text