Abstract
Detecting faults in source code to fix is an important task in the software quality assurance. Building automated detectors using machine learning has been faced two big challenges of data imbalance and shortages. To address the issues, this paper proposes a deep neural network and training procedures to allow learning with limited annotated data. The network is composed of an unsupervised auto-encoder and a supervised classifier. The two components share some first layers that plays as a program feature extractor. Interestingly, we can leverage a large amount of unlabeled data from various sources to train the auto-encoder independently then transfer to the target domain. Additionally, sharing layers, and jointly training the reconstruction and the classification tasks stimulate the generation of the sophisticated features. We conducted the experiments on four real datasets with different amount of labeled data and with adding more unlabeled data. The results have confirmed that the multi-task outperforms the single-task and leveraging the unlabeled data is beneficial. Specifically, when reducing the labeled data from 100% to 75%, 50%, 25%, the performance of several deep networks drops sharply, while it reduces gradually for our model.
Similar content being viewed by others
Notes
Our implementation is publicly available at https://github.com/pvanh/ae_cnn_multitask
References
Laradji I H, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Chidamber S R, Kemerer C F (1994) A metrics suite for object oriented design. IEEE Transactions on software engineering 20(6):476–493
Curtis B, Sheppard S B, Milliman P, Borst MA, Love T (1979) Measuring the psychological complexity of software maintenance tasks with the halstead and mccabe metrics. IEEE Transactions on software engineering 2:96–104
Manjula C, Florence L (2019) Deep neural network based hybrid approach for software defect prediction using software metrics. Clust Comput 22(4):9847–9863
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE access 6:24184–24195
Phan A V, Le Nguyen M, Bui L T (2017) Sibstcnn and tbcnn+ knn-ted: New models over tree structures for source code classification. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, pp 120–128
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI Conference on Artificial Intelligence
Li J, He P, Zhu J, Lyu M R (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), IEEE, pp 318–328
Phan A V, Le Nguyen M, Bui L T (2017) Convolutional neural networks over control flow graphs for software defect prediction. In: 2017 IEEE 29th International conference on tools with artificial intelligence (ICTAI), IEEE, pp 45–52
Phan A V, Le Nguyen M (2017) Convolutional neural networks on assembly code for predicting software defects. In: 2017 21st Asia pacific symposium on intelligent and evolutionary systems (IES), IEEE, pp 37–42
Sayyad Shirabad J, Menzies TJ (2005) The PROMISE Repository of Software Engineering Databases, School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository
Minku L L, Mendes E, Turhan B (2016) Data mining for software engineering and humans in the loop. Progress in artificial intelligence 5(4):307–314
Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81(2):186–195
Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures: An empirical case study. In: First international symposium on empirical software engineering and measurement (ESEM 2007), IEEE, pp 364– 373
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering, pp 181–190
Elish K O, Elish M O (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111
Kaner C, et al. (2004) Software engineering metrics: What do they measure and how do we know?. In: In METRICS 2004. IEEE CS, Citeseer
Phan A V, Le Nguyen M, Nguyen Y L H, Bui L T (2018) Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw 108:533–543
Shippey T, Bowes D, Hall T (2019) Automatically identifying code features for software defect prediction: Using ast n-grams. Inf Softw Technol 106:142–160
Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th International conference on software engineering (ICSE), IEEE, pp 297– 308
Lin G, Zhang J, Luo W, Pan L, Xiang Y, De Vel O, Montague P (2018) Cross-project transfer representation learning for vulnerable function discovery. IEEE transactions on industrial informatics 14(7):3289–3297
Weiss K, Khoshgoftaar T M, Wang D (2016) A survey of transfer learning. Journal of Big data 3(1):9
Wang C, Mahadevan S (2011) Heterogeneous domain adaptation using manifold alignment. In: Twenty-second international joint conference on artificial intelligence
Zhu Y, Chen Y, Lu Z, Pan S J, Xue G-R, Yu Y, Yang Q (2011) Heterogeneous transfer learning for image classification. In: Twenty-Fifth AAAI Conference on Artificial Intelligence
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv:1801.06146
Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Bae H-J, Kim C-W, Kim N, Park B, Kim N, Seo J B, Lee S M (2018) A perlin noise-based augmentation strategy for deep learning with small data samples of hrct images. Scientific reports 8 (1):1–7
Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621
Pennington J, Socher R, Manning C D (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP). http://www.aclweb.org/anthology/D14-1162, pp 1532–1543
Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338
Li P, Zhong P, Mao K, Wang D, Yang X, Liu Y, Yin J-, See S (2021) Act: an attentive convolutional transformer for efficient text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 13261–13269
Dietterich T G (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10(7):1895–1923
Tan B, Song Y, Zhong E, Yang Q (2015) Transitive transfer learning. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1155–1164
Acknowledgements
This research is funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2018.306.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Phan, A.V., Nguyen, K.D.T. & Bui, L.T. Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data. Appl Intell 53, 3877–3888 (2023). https://doi.org/10.1007/s10489-022-03663-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03663-5