Skip to main content
Log in

Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Detecting faults in source code to fix is an important task in the software quality assurance. Building automated detectors using machine learning has been faced two big challenges of data imbalance and shortages. To address the issues, this paper proposes a deep neural network and training procedures to allow learning with limited annotated data. The network is composed of an unsupervised auto-encoder and a supervised classifier. The two components share some first layers that plays as a program feature extractor. Interestingly, we can leverage a large amount of unlabeled data from various sources to train the auto-encoder independently then transfer to the target domain. Additionally, sharing layers, and jointly training the reconstruction and the classification tasks stimulate the generation of the sophisticated features. We conducted the experiments on four real datasets with different amount of labeled data and with adding more unlabeled data. The results have confirmed that the multi-task outperforms the single-task and leveraging the unlabeled data is beneficial. Specifically, when reducing the labeled data from 100% to 75%, 50%, 25%, the performance of several deep networks drops sharply, while it reduces gradually for our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://promise.site.uottawa.ca/SERepository/

  2. Our implementation is publicly available at https://github.com/pvanh/ae_cnn_multitask

  3. https://www.codechef.com/problems/<problem-name>

References

  1. Laradji I H, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402

    Article  Google Scholar 

  2. Chidamber S R, Kemerer C F (1994) A metrics suite for object oriented design. IEEE Transactions on software engineering 20(6):476–493

    Article  Google Scholar 

  3. Curtis B, Sheppard S B, Milliman P, Borst MA, Love T (1979) Measuring the psychological complexity of software maintenance tasks with the halstead and mccabe metrics. IEEE Transactions on software engineering 2:96–104

    Article  MATH  Google Scholar 

  4. Manjula C, Florence L (2019) Deep neural network based hybrid approach for software defect prediction using software metrics. Clust Comput 22(4):9847–9863

    Article  Google Scholar 

  5. Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE access 6:24184–24195

    Article  Google Scholar 

  6. Phan A V, Le Nguyen M, Bui L T (2017) Sibstcnn and tbcnn+ knn-ted: New models over tree structures for source code classification. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, pp 120–128

  7. Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI Conference on Artificial Intelligence

  8. Li J, He P, Zhu J, Lyu M R (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), IEEE, pp 318–328

  9. Phan A V, Le Nguyen M, Bui L T (2017) Convolutional neural networks over control flow graphs for software defect prediction. In: 2017 IEEE 29th International conference on tools with artificial intelligence (ICTAI), IEEE, pp 45–52

  10. Phan A V, Le Nguyen M (2017) Convolutional neural networks on assembly code for predicting software defects. In: 2017 21st Asia pacific symposium on intelligent and evolutionary systems (IES), IEEE, pp 37–42

  11. Sayyad Shirabad J, Menzies TJ (2005) The PROMISE Repository of Software Engineering Databases, School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository

  12. Minku L L, Mendes E, Turhan B (2016) Data mining for software engineering and humans in the loop. Progress in artificial intelligence 5(4):307–314

    Article  Google Scholar 

  13. Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81(2):186–195

    Article  Google Scholar 

  14. Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures: An empirical case study. In: First international symposium on empirical software engineering and measurement (ESEM 2007), IEEE, pp 364– 373

  15. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering, pp 181–190

  16. Elish K O, Elish M O (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660

    Article  Google Scholar 

  17. Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111

    Article  Google Scholar 

  18. Kaner C, et al. (2004) Software engineering metrics: What do they measure and how do we know?. In: In METRICS 2004. IEEE CS, Citeseer

  19. Phan A V, Le Nguyen M, Nguyen Y L H, Bui L T (2018) Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw 108:533–543

    Article  Google Scholar 

  20. Shippey T, Bowes D, Hall T (2019) Automatically identifying code features for software defect prediction: Using ast n-grams. Inf Softw Technol 106:142–160

    Article  Google Scholar 

  21. Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th International conference on software engineering (ICSE), IEEE, pp 297– 308

  22. Lin G, Zhang J, Luo W, Pan L, Xiang Y, De Vel O, Montague P (2018) Cross-project transfer representation learning for vulnerable function discovery. IEEE transactions on industrial informatics 14(7):3289–3297

    Article  Google Scholar 

  23. Weiss K, Khoshgoftaar T M, Wang D (2016) A survey of transfer learning. Journal of Big data 3(1):9

    Article  Google Scholar 

  24. Wang C, Mahadevan S (2011) Heterogeneous domain adaptation using manifold alignment. In: Twenty-second international joint conference on artificial intelligence

  25. Zhu Y, Chen Y, Lu Z, Pan S J, Xue G-R, Yu Y, Yang Q (2011) Heterogeneous transfer learning for image classification. In: Twenty-Fifth AAAI Conference on Artificial Intelligence

  26. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  27. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv:1801.06146

  28. Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365

  29. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9

    Google Scholar 

  30. Bae H-J, Kim C-W, Kim N, Park B, Kim N, Seo J B, Lee S M (2018) A perlin noise-based augmentation strategy for deep learning with small data samples of hrct images. Scientific reports 8 (1):1–7

    Article  Google Scholar 

  31. Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621

  32. Pennington J, Socher R, Manning C D (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP). http://www.aclweb.org/anthology/D14-1162, pp 1532–1543

  33. Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338

    Article  Google Scholar 

  34. Li P, Zhong P, Mao K, Wang D, Yang X, Liu Y, Yin J-, See S (2021) Act: an attentive convolutional transformer for efficient text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 13261–13269

  35. Dietterich T G (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10(7):1895–1923

    Article  Google Scholar 

  36. Tan B, Song Y, Zhong E, Yang Q (2015) Transitive transfer learning. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1155–1164

Download references

Acknowledgements

This research is funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2018.306.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anh Viet Phan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Phan, A.V., Nguyen, K.D.T. & Bui, L.T. Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data. Appl Intell 53, 3877–3888 (2023). https://doi.org/10.1007/s10489-022-03663-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03663-5

Keywords

Navigation