Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

Phan, Anh Viet; Nguyen, Khanh Duy Tung; Bui, Lam Thu

doi:10.1007/s10489-022-03663-5

Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

Published: 04 June 2022

Volume 53, pages 3877–3888, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

307 Accesses
1 Altmetric
Explore all metrics

Abstract

Detecting faults in source code to fix is an important task in the software quality assurance. Building automated detectors using machine learning has been faced two big challenges of data imbalance and shortages. To address the issues, this paper proposes a deep neural network and training procedures to allow learning with limited annotated data. The network is composed of an unsupervised auto-encoder and a supervised classifier. The two components share some first layers that plays as a program feature extractor. Interestingly, we can leverage a large amount of unlabeled data from various sources to train the auto-encoder independently then transfer to the target domain. Additionally, sharing layers, and jointly training the reconstruction and the classification tasks stimulate the generation of the sophisticated features. We conducted the experiments on four real datasets with different amount of labeled data and with adding more unlabeled data. The results have confirmed that the multi-task outperforms the single-task and leveraging the unlabeled data is beneficial. Specifically, when reducing the labeled data from 100% to 75%, 50%, 25%, the performance of several deep networks drops sharply, while it reduces gradually for our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Article 04 April 2024

Jian Shen, Zhong Li, … Xuandong Li

DeepReview: Automatic Code Review Using Deep Multi-instance Learning

Towards One Reusable Model for Various Software Defect Mining Tasks

Notes

http://promise.site.uottawa.ca/SERepository/
Our implementation is publicly available at https://github.com/pvanh/ae_cnn_multitask
https://www.codechef.com/problems/<problem-name>

References

Laradji I H, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Article Google Scholar
Chidamber S R, Kemerer C F (1994) A metrics suite for object oriented design. IEEE Transactions on software engineering 20(6):476–493
Article Google Scholar
Curtis B, Sheppard S B, Milliman P, Borst MA, Love T (1979) Measuring the psychological complexity of software maintenance tasks with the halstead and mccabe metrics. IEEE Transactions on software engineering 2:96–104
Article MATH Google Scholar
Manjula C, Florence L (2019) Deep neural network based hybrid approach for software defect prediction using software metrics. Clust Comput 22(4):9847–9863
Article Google Scholar
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE access 6:24184–24195
Article Google Scholar
Phan A V, Le Nguyen M, Bui L T (2017) Sibstcnn and tbcnn+ knn-ted: New models over tree structures for source code classification. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, pp 120–128
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI Conference on Artificial Intelligence
Li J, He P, Zhu J, Lyu M R (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), IEEE, pp 318–328
Phan A V, Le Nguyen M, Bui L T (2017) Convolutional neural networks over control flow graphs for software defect prediction. In: 2017 IEEE 29th International conference on tools with artificial intelligence (ICTAI), IEEE, pp 45–52
Phan A V, Le Nguyen M (2017) Convolutional neural networks on assembly code for predicting software defects. In: 2017 21st Asia pacific symposium on intelligent and evolutionary systems (IES), IEEE, pp 37–42
Sayyad Shirabad J, Menzies TJ (2005) The PROMISE Repository of Software Engineering Databases, School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository
Minku L L, Mendes E, Turhan B (2016) Data mining for software engineering and humans in the loop. Progress in artificial intelligence 5(4):307–314
Article Google Scholar
Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81(2):186–195
Article Google Scholar
Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures: An empirical case study. In: First international symposium on empirical software engineering and measurement (ESEM 2007), IEEE, pp 364– 373
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering, pp 181–190
Elish K O, Elish M O (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660
Article Google Scholar
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111
Article Google Scholar
Kaner C, et al. (2004) Software engineering metrics: What do they measure and how do we know?. In: In METRICS 2004. IEEE CS, Citeseer
Phan A V, Le Nguyen M, Nguyen Y L H, Bui L T (2018) Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw 108:533–543
Article Google Scholar
Shippey T, Bowes D, Hall T (2019) Automatically identifying code features for software defect prediction: Using ast n-grams. Inf Softw Technol 106:142–160
Article Google Scholar
Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th International conference on software engineering (ICSE), IEEE, pp 297– 308
Lin G, Zhang J, Luo W, Pan L, Xiang Y, De Vel O, Montague P (2018) Cross-project transfer representation learning for vulnerable function discovery. IEEE transactions on industrial informatics 14(7):3289–3297
Article Google Scholar
Weiss K, Khoshgoftaar T M, Wang D (2016) A survey of transfer learning. Journal of Big data 3(1):9
Article Google Scholar
Wang C, Mahadevan S (2011) Heterogeneous domain adaptation using manifold alignment. In: Twenty-second international joint conference on artificial intelligence
Zhu Y, Chen Y, Lu Z, Pan S J, Xue G-R, Yu Y, Yang Q (2011) Heterogeneous transfer learning for image classification. In: Twenty-Fifth AAAI Conference on Artificial Intelligence
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv:1801.06146
Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Google Scholar
Bae H-J, Kim C-W, Kim N, Park B, Kim N, Seo J B, Lee S M (2018) A perlin noise-based augmentation strategy for deep learning with small data samples of hrct images. Scientific reports 8 (1):1–7
Article Google Scholar
Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621
Pennington J, Socher R, Manning C D (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP). http://www.aclweb.org/anthology/D14-1162, pp 1532–1543
Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338
Article Google Scholar
Li P, Zhong P, Mao K, Wang D, Yang X, Liu Y, Yin J-, See S (2021) Act: an attentive convolutional transformer for efficient text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 13261–13269
Dietterich T G (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10(7):1895–1923
Article Google Scholar
Tan B, Song Y, Zhong E, Yang Q (2015) Transitive transfer learning. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1155–1164

Download references

Acknowledgements

This research is funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2018.306.

Author information

Authors and Affiliations

Le Quy Don Technical University, 236 Hoang Quoc Viet St., Hanoi, Vietnam
Anh Viet Phan & Khanh Duy Tung Nguyen
Academy of Cryptography Techniques, 141 Chien Thang St., Hanoi, Vietnam
Lam Thu Bui

Authors

Anh Viet Phan
View author publications
You can also search for this author in PubMed Google Scholar
Khanh Duy Tung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Lam Thu Bui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anh Viet Phan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Phan, A.V., Nguyen, K.D.T. & Bui, L.T. Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data. Appl Intell 53, 3877–3888 (2023). https://doi.org/10.1007/s10489-022-03663-5

Download citation

Accepted: 20 April 2022
Published: 04 June 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10489-022-03663-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

Abstract

Access this article

Similar content being viewed by others

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

DeepReview: Automatic Code Review Using Deep Multi-instance Learning

Towards One Reusable Model for Various Software Defect Mining Tasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

Abstract

Access this article

Similar content being viewed by others

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

DeepReview: Automatic Code Review Using Deep Multi-instance Learning

Towards One Reusable Model for Various Software Defect Mining Tasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation