Skip to main content
Log in

Sample-based software defect prediction with active and semi-supervised learning

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Software defect prediction can help us better understand and control software quality. Current defect prediction techniques are mainly based on a sufficient amount of historical project data. However, historical data is often not available for new projects and for many organizations. In this case, effective defect prediction is difficult to achieve. To address this problem, we propose sample-based methods for software defect prediction. For a large software system, we can select and test a small percentage of modules, and then build a defect prediction model to predict defect-proneness of the rest of the modules. In this paper, we describe three methods for selecting a sample: random sampling with conventional machine learners, random sampling with a semi-supervised learner and active sampling with active semi-supervised learner. To facilitate the active sampling, we propose a novel active semi-supervised learning method ACoForest which is able to sample the modules that are most helpful for learning a good prediction model. Our experiments on PROMISE datasets show that the proposed methods are effective and have potential to be applied to industrial practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)

    Google Scholar 

  • Balcan, M.F., Broder, A.Z., Zhang, T.: Margin based active learning. In: Proceedings of the 20th Annual Conference on Learning Theory, San Diego, CA, pp. 35–50 (2007)

    Google Scholar 

  • Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)

    MATH  MathSciNet  Google Scholar 

  • Blum, A., Mitchell, T.: Combining labeled and unlabeled datawith co-training. In: Proceedings of the 11th Annual Conf. on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  • Chapelle, O., Zien, A.: Semi-supervised learning by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Savannah Hotel, Barbados, pp. 57–64 (2005)

    Google Scholar 

  • Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006)

    Google Scholar 

  • Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, pp. 150–157 (1994)

    Google Scholar 

  • Freund, Y.H.S., Seung, E.S., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133–168 (1997)

    Article  MATH  Google Scholar 

  • Gibbons, J.D.: Nonparametric Statistical Inference. Marcel Dekker, New York (1985)

    MATH  Google Scholar 

  • Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 327–334 (2000)

    Google Scholar 

  • Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17, pp. 529–536. MIT Press, Cambridge (2005)

    Google Scholar 

  • Hassan, E.: Predicting faults using the complexity of code changes. In: Proceedings of the 31th International Conference on Software Engineering, Vancouver, Canada, pp. 78–88 (2009)

    Google Scholar 

  • Jiang, Y., Li, M., Zhou, Z.-H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)

    Article  Google Scholar 

  • Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, pp. 200–209 (1999)

    Google Scholar 

  • Kim, S., Zimmermann, T., Whitehead, E., Zeller, J.A.: Predicting faults from cached history. In: Proceedings of ICSE’07, Minneapolis, USA, pp. 489–498 (2007)

    Google Scholar 

  • Koru, L.H., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)

    Article  Google Scholar 

  • Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)

    Article  Google Scholar 

  • Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3–12 (1994)

    Google Scholar 

  • Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, pp. 148–156 (1994)

    Google Scholar 

  • Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern., Part A, Syst. Hum. 37(6), 1088–1098 (2007)

    Article  Google Scholar 

  • Li, M., Li, H., Zhou, Z.H.: Semi-supervised document retrieval. Inf. Process. Manag. 45(3), 341–355 (2009)

    Article  Google Scholar 

  • Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)

    Article  Google Scholar 

  • Miller, D.J., Uyar, H.S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, pp. 571–577. MIT Press, Cambridge (1997)

    Google Scholar 

  • Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th International Conference on Machine Learning, Sydney, Australia, pp. 435–442 (2002)

    Google Scholar 

  • Nagappan, N., Ball, T., Zeller, A.: Mining metrics to predict component failures. In: Proceedings of ICSE’06, Shanghai, China, pp. 452–461 (2006)

    Google Scholar 

  • Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)

    Article  MATH  Google Scholar 

  • Seung, H., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the 5th ACM Workshop on Computational Learning Theory, Pittsburgh, PA, pp. 287–294 (1992)

    Google Scholar 

  • Shahshahani, B., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)

    Article  Google Scholar 

  • Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Bootstrapping statistical parsers from small data sets. In: Proceedings of the 11th Conference on the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 331–338 (2003)

    Google Scholar 

  • Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, pp. 999–1006 (2000)

    Google Scholar 

  • Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14, 540–578 (2009). doi:10.1007/s10515-011-0092-1. See http://portal.acm.org/citation.cfm?id=1612763.1612782

    Article  Google Scholar 

  • Wang, W., Zhou, Z.H.: On multi-view active learning and the combination with semi-supervised learning. In: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, pp. 1152–1159 (2008)

    Chapter  Google Scholar 

  • Xu, J.M., Fumera, G., Roli, F., Zhou, Z.H.: Training spam assassin with active semi-supervised learning. In: Proceedings of the 6th Conference on Email and Anti-Spam, Mountain View, CA (2009)

    Google Scholar 

  • Zhang, H.: An investigation of the relationships between lines of code and defects. In: Proceedings of 25th IEEE International Conference on Software Maintenance, Edmonton, Canada, pp. 274–283 (2009)

    Chapter  Google Scholar 

  • Zhang, H., Wu, R.: Sampling program quality. In: Proceedings of 26th IEEE International Conference on Software Maintenance, Timisoara, Romania, pp. 1–10 (2010)

    Google Scholar 

  • Zhang, H., Zhang, X.: Comments on “Data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)

    Article  MATH  Google Scholar 

  • Zhang, H., Zhang, X., Gu, M.: Predicting defective software components from code complexity measures. In: Proceedings of 13th IEEE Pacific Rim International Symposium on Dependable Computing, Australia, pp. 93–96 (2007)

    Google Scholar 

  • Zhang, H., Nelson, A., Menzies, T.: On the value of learning from defect dense components for software defect prediction. In: Proceedings of International Conference on Predictor Models in Software Engineering, Timisoara, Romania, p. 14 (2010)

    Google Scholar 

  • Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)

    Google Scholar 

  • Zhou, Z.-H.: When semi-supervised learning meets ensemble learning. In: Proceedings of 8th International Workshop on Multiple Classifier Systems, Reykjavik, Iceland, pp. 529–538 (2009)

    Chapter  Google Scholar 

  • Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)

    Article  Google Scholar 

  • Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)

    Article  Google Scholar 

  • Zhou, Z.-H., Li, M.: Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24(3), 415–439 (2010)

    Article  Google Scholar 

  • Zhou, Z.-H., Chen, K.J., Dai, H.B.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Trans. Inf. Sys. 24(2), 219–244 (2006)

    Article  Google Scholar 

  • Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI (2006). http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf

  • Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, Washington, DC, pp. 912–919 (2003)

    Google Scholar 

  • Zimmermann, T., Nagappan, N.: Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, pp. 531–540 (2008)

    Google Scholar 

  • Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of International Conference on Predictor Models in Software Engineering, Minneapolis, USA (2007)

    Google Scholar 

  • Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In: Proceedings of ESEC/FSE 2009, Amsterdam, The Netherlands, pp. 91–100 (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyu Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Zhang, H., Wu, R. et al. Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19, 201–230 (2012). https://doi.org/10.1007/s10515-011-0092-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-011-0092-1

Keywords

Navigation