skip to main content
10.1145/3510003.3510136acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A universal data augmentation approach for fault localization

Published: 05 July 2022 Publication History

Abstract

Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely unbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques.
To overcome the limitations, we propose Aeneas, a universal data augmentation approach that gener<u>A</u>t<u>e</u>s sy<u>n</u>thesized failing t<u>e</u>st cases from reduced fe<u>a</u>ture <u>s</u>pace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.

References

[1]
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2006. An evaluation of similarity coefficients for software fault localization. In 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06). IEEE, 39--46.
[2]
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, 89--98.
[3]
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2009. Spectrum-based multiple fault localization. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 88--99.
[4]
Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. 2017. Understanding of a convolutional neural network. In 2017 International Conference on Engineering and Technology (ICET). Ieee, 1--6.
[5]
Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2017. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017).
[6]
José Campos, André Riboira, Alexandre Perez, and Rui Abreu. 2012. Gzoltar: an eclipse plug-in for testing and debugging. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. 378--381.
[7]
Prantik Chatterjee, Abhijit Chatterjee, José Campos, Rui Abreu, and Subhajit Roy. [n.d.]. Diagnosing Software Faults Using Multiverse Analysis. ([n. d.]).
[8]
Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3642--3649.
[9]
Carl Doersch. 2016. Tutorial on Variational Autoencoders. (2016), 1--23. arXiv:1606.05908 http://arxiv.org/abs/1606.05908
[10]
Richard O Duda, Peter E Hart, and David G Stork. 2001. Pattern Classification: Wiley Interscience. NY, USA (2001).
[11]
Yichao Gao, Zhenyu Zhang, Long Zhang, Cheng Gong, and Zheng Zheng. 2013. A theoretical study: The impact of cloning failed test cases on the effectiveness of fault localization. In 2013 13th International Conference on Quality Software. IEEE, 288--291.
[12]
Dario Garcia-Gasulla, Ferran Parés, Armand Vilalta, Jonatan Moreno, Eduard Ayguadé, Jesús Labarta, Ulises Cortés, and Toyotaro Suzumura. 2018. On the behavior of convolutional nets for feature extraction. Journal of Artificial Intelligence Research 61 (2018), 563--592.
[13]
Cheng Gong, Zheng Zheng, Wei Li, and Peng Hao. 2012. Effects of class imbalance in test suites: an empirical study of spectrum-based fault localization. In 2012 IEEE 36th Annual Computer Software and Applications Conference Workshops. IEEE, 470--475.
[14]
Quanquan Gu, Zhenhui Li, and Jiawei Han. 2011. Generalized fisher score for feature selection. Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011 (2011), 266--273. arXiv:1202.3725
[15]
Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263--1284.
[16]
Simon Heiden, Lars Grunske, Timo Kehrer, Fabian Keller, Andre Van Hoorn, Antonio Filieri, and David Lo. 2019. An evaluation of pure spectrum-based fault localization techniques for large-scale software systems. Software: Practice and Experience 49, 8 (2019), 1197--1224.
[17]
Kai Huang, Ximeng Liu, Shaojing Fu, Deke Guo, and Ming Xu. 2019. A lightweight privacy-preserving CNN feature extraction framework for mobile sensing. IEEE Transactions on Dependable and Secure Computing (2019).
[18]
Hidenori Ide and Takio Kurita. 2017. Improvement of learning for CNN with ReLU activation by sparse regularization. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2684--2691.
[19]
Manjunath Jogin, MS Madhulika, GD Divya, RK Meghana, S Apoorva, et al. 2018. Feature extraction using convolution neural networks (CNN) and deep learning. In 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, 2319--2323.
[20]
Ian T Jolliffe, Jorge Cadima, and Jorge Cadima. 2016. Principal component analysis : a review and recent developments Subject Areas. Phil.Trans.R.Soc.A 374, 20150202 (2016), 1--16.
[21]
James A Jones. 2004. Fault localization using visualization of test information. In Proceedings. 26th International Conference on Software Engineering. IEEE, 54--56.
[22]
James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273--282.
[23]
James A Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering. ICSE 2002. IEEE, 467--477.
[24]
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners' expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 165--176.
[25]
Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5, 4 (2016), 221--232.
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097--1105.
[27]
Yiğit Küçük, Tim AD Henderson, and Andy Podgurski. 2021. Improving fault localization by integrating value and predicate based causal inference techniques. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 649--660.
[28]
Hua Jie Lee, Lee Naish, and Kotagiri Ramamohanarao. [n.d.]. Effective Software Bug Localization Using Spectral Frequency Weighting Function. In Proceedings of the 34th Annual Computer Software and Applications Conference (COMPSAC 2010), (2010). IEEE, 218--227.
[29]
Yan Lei, Xiaoguang Mao, Min Zhang, Jingan Ren, and Yinhua Jiang. [n.d.]. Toward Understanding Information Models of Fault Localization: Elaborate is Not Always Better. In The 41st Annual Computer Software and Applications Conference(COMPSAC 2017) (2017). 57--66.
[30]
Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. 2019. Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 169--180.
[31]
Xia Li and Lingming Zhang. 2017. Transforming programs and tests in tandem for fault localization. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1--30.
[32]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Fault Localization with Code Coverage Representation Learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 661--673.
[33]
Yiling Lou, Ali Ghanbari, Xia Li, Lingming Zhang, Haotian Zhang, Dan Hao, and Lu Zhang. 2020. Can automated program repair refine fault localization? a unified debugging approach. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 75--87.
[34]
Abha Maru, Arpita Dutta, K Vinod Kumar, and Durga Prasad Mohapatra. 2019. Software fault localization using BP neural network based on function and branch coverage. Evolutionary Intelligence (2019), 1--18.
[35]
Seongkyu Mun, Sangwook Park, David K Han, and Hanseok Ko. 2017. Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Proc. DCASE (2017), 93--97.
[36]
Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. 2011. A model for spectra-based software diagnosis. ACM Transactions on software engineering and methodology (TOSEM) 20, 3 (2011), 1--32.
[37]
Hiromitsu Nishizaki. 2017. Data augmentation and feature extraction using variational autoencoder for acoustic modeling. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1222--1227.
[38]
Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D Ernst, Deric Pang, and Benjamin Keller. 2016. Evaluating & improving fault localization techniques. University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UW-CSE-16-08-03 (2016), 27.
[39]
Luis Perez and Jason Wang. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017).
[40]
Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. 2015. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229 (2015).
[41]
Giuseppe Scarpa, Massimiliano Gargiulo, Antonio Mazza, and Raffaele Gaetano. 2018. A CNN-based fusion method for feature extraction from sentinel data. Remote Sensing 10, 2 (2018), 236.
[42]
Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1--48.
[43]
Leon Sixt, Benjamin Wild, and Tim Landgraf. 2018. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI 5 (2018), 66.
[44]
Jeongju Sohn and Shin Yoo. 2017. Fluccs: Using code and change metrics to improve fault localization. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 273--283.
[45]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28 (2015), 3483--3491.
[46]
Fengxi Song, Zhongwei Guo, and Dayong Mei. 2010. Feature Selection Using Principal Component Analysis. In 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization, Vol. 1. 27--30.
[47]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.
[48]
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In International conference on machine learning. PMLR, 1058--1066.
[49]
Haifeng Wang, Bin Du, Jie He, Yong Liu, and Xiang Chen. 2020. IETCR: An Information Entropy Based Test Case Reduction Strategy for Mutation-Based Fault Localization. IEEE Access 8 (2020), 124297--124310.
[50]
Qian Wang, Fanlin Meng, and Toby P Breckon. 2020. Data augmentation with norm-VAE for unsupervised domain adaptation. arXiv preprint arXiv:2012.00848 (2020).
[51]
Ming Wen, Junjie Chen, Yongqiang Tian, Rongxin Wu, Dan Hao, Shi Han, and Shing-Chi Cheung. 2019. Historical spectrum based fault localization. IEEE Transactions on Software Engineering (2019).
[52]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics. Springer, 196--202.
[53]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1--3 (1987), 37--52.
[54]
W Eric Wong, Vidroha Debroy, Richard Golden, Xiaofeng Xu, and Bhavani Thuraisingham. 2011. Effective software fault localization using an RBF neural network. IEEE Transactions on Reliability 61, 1 (2011), 149--169.
[55]
W Eric Wong, Vidroha Debroy, Yihao Li, and Ruizhi Gao. 2012. Software fault localization using dstar (d*). In 2012 IEEE Sixth International Conference on Software Security and Reliability. IEEE, 21--30.
[56]
W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707--740.
[57]
W Eric Wong, Lei Zhao, Yu Qi, Kai-Yuan Cai, and Jing Dong. 2007. Effective Fault Localization using BP Neural Networks. In SEKE. Citeseer, 374--379.
[58]
Zhanghao Wu, Shuai Wang, Yanmin Qian, and Kai Yu. 2019. Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification. In INTERSPEECH. 1163--1167.
[59]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5542--5551.
[60]
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-vaegan-d2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10275--10284.
[61]
Xi Xiao, Yuqing Pan, Bin Zhang, Guangwu Hu, Qing Li, and Runiu Lu. 2021. ALBFL: A Novel Neural Ranking Model for Software Fault Localization via Combining Static and Dynamic Features. Information and Software Technology (2021), 106653.
[62]
X. Xie, Tsongyueh Chen, Feiching Kuo, and B. Xu. 2013. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Transactions on Software Engineering and Methodology (TOSEM) (2013).
[63]
Xiaofeng Xu, Vidroha Debroy, W Eric Wong, and Donghui Guo. 2011. Ties within fault localization rankings: Exposing and addressing the problem. International Journal of Software Engineering and Knowledge Engineering 21, 06 (2011), 803--827.
[64]
Bei zhang. 2016. Fault Localization Method Based on Enhanced GA-BP Neural Network. In Proceedings of The fourth International Conference on Information Science and Cloud Computing --- PoS(ISCC2015). Sissa Medialab, Guangzhou, China, 054.
[65]
Long Zhang, Lanfei Yan, Zhenyu Zhang, Jian Zhang, WK Chan, and Zheng Zheng. 2017. A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. Journal of Systems and Software 129 (2017), 35--57.
[66]
Mengmeng Zhang, Wei Li, Qian Du, Lianru Gao, and Bing Zhang. 2018. Feature extraction for classification of hyperspectral and LiDAR data using patch-to-patch CNN. IEEE transactions on cybernetics 50, 1 (2018), 100--111.
[67]
Zhuo Zhang, Yan Lei, Xiaoguang Mao, and Panpan Li. 2019. CNN-FL: An effective approach for localizing faults using convolutional neural networks. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 445--455.
[68]
Zhuo Zhang, Yan Lei, Xiaoguang Mao, Meng Yan, Ling Xu, and Junhao Wen. 2021. Improving deep-learning-based fault localization with resampling. Journal of Software: Evolution and Process 33, 3 (2021), e2312.
[69]
Zhuo Zhang, Yan Lei, Xiaoguang Mao, Meng Yan, Ling Xu, and Xiaohong Zhang. [n.d.]. A study of effectiveness of deep learning in locating real faults. 131 ([n. d.]), 106486.
[70]
Zhuo Zhang, Yan Lei, Qingping Tan, Xiaoguang Mao, Ping Zeng, and Xi Chang. 2017. Deep learning-based fault localization with contextual information. IEICE Transactions on Information and Systems 100, 12 (2017), 3027--3031.
[71]
Wei Zheng, Desheng Hu, and Jing Wang. 2016. Fault localization analysis based on deep neural network. Mathematical Problems in Engineering 2016 (2016).
[72]
Fengtao Zhou, Sheng Huang, and Yun Xing. 2020. Deep Semantic Dictionary Learning for Multi-label Image Classification. arXiv preprint arXiv:2012.12509 (2020).
[73]
Xinyue Zhu, Yifan Liu, Zengchang Qin, and Jiahong Li. 2017. Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648 (2017).

Cited By

View all
  • (2025)Feature learning for bearing prognostics: A comprehensive review of machine/deep learning methods, challenges, and opportunitiesMeasurement10.1016/j.measurement.2024.116589245(116589)Online publication date: Mar-2025
  • (2025)CG-FL: A data augmentation approach using context-aware genetic algorithm for fault localizationJournal of Systems and Software10.1016/j.jss.2025.112359222(112359)Online publication date: Apr-2025
  • (2024)A Data Augmentation Method for Fault Localization with Fault Propagation Context and VAEIEICE Transactions on Information and Systems10.1587/transinf.2023EDL8052E107.D:2(234-238)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
ISBN:9781450392211
DOI:10.1145/3510003
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data augmentation
  2. fault localization
  3. imbalanced data

Qualifiers

  • Research-article

Funding Sources

  • the National Natural Science Foundation of China
  • the Fundamental Research Funds for the Central Universities
  • the Natural Science Foundation of Chongqing
  • the National Key Research and Development Project of China
  • the National Defense Basic Scientific Research Project

Conference

ICSE '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)8
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Feature learning for bearing prognostics: A comprehensive review of machine/deep learning methods, challenges, and opportunitiesMeasurement10.1016/j.measurement.2024.116589245(116589)Online publication date: Mar-2025
  • (2025)CG-FL: A data augmentation approach using context-aware genetic algorithm for fault localizationJournal of Systems and Software10.1016/j.jss.2025.112359222(112359)Online publication date: Apr-2025
  • (2024)A Data Augmentation Method for Fault Localization with Fault Propagation Context and VAEIEICE Transactions on Information and Systems10.1587/transinf.2023EDL8052E107.D:2(234-238)Online publication date: 1-Feb-2024
  • (2024)Combining Coverage and Expert Features with Semantic Representation for Coincidental Correctness DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695542(1770-1782)Online publication date: 27-Oct-2024
  • (2024)Do not neglect what's on your hands: localizing software faults with exception trigger streamProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695479(982-994)Online publication date: 27-Oct-2024
  • (2024)Evaluating Fault Localization and Program Repair Capabilities of Existing Closed-Source General-Purpose LLMsProceedings of the 1st International Workshop on Large Language Models for Code10.1145/3643795.3648390(75-78)Online publication date: 20-Apr-2024
  • (2024)Towards More Precise Coincidental Correctness Detection with Deep Semantic LearningIEEE Transactions on Software Engineering10.1109/TSE.2024.3481893(1-24)Online publication date: 2024
  • (2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: 1-Oct-2024
  • (2024)On the Stability and Applicability of Deep Learning in Fault Localization2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00062(546-555)Online publication date: 12-Mar-2024
  • (2024)FusionFL: A Statement-Level Feature Fusion Based Fault Localization Approach2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00013(37-46)Online publication date: 27-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media