A universal data augmentation approach for fault localization

Published: 05 July 2022


Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely unbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques.
To overcome the limitations, we propose Aeneas, a universal data augmentation approach that gener<u>A</u>t<u>e</u>s sy<u>n</u>thesized failing t<u>e</u>st cases from reduced fe<u>a</u>ture <u>s</u>pace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.


ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
Published: 05 July 2022


Author Tags

  1. data augmentation
  2. fault localization
  3. imbalanced data


