Abstract
Software defect prediction can help us better understand and control software quality. Current defect prediction techniques are mainly based on a sufficient amount of historical project data. However, historical data is often not available for new projects and for many organizations. In this case, effective defect prediction is difficult to achieve. To address this problem, we propose sample-based methods for software defect prediction. For a large software system, we can select and test a small percentage of modules, and then build a defect prediction model to predict defect-proneness of the rest of the modules. In this paper, we describe three methods for selecting a sample: random sampling with conventional machine learners, random sampling with a semi-supervised learner and active sampling with active semi-supervised learner. To facilitate the active sampling, we propose a novel active semi-supervised learning method ACoForest which is able to sample the modules that are most helpful for learning a good prediction model. Our experiments on PROMISE datasets show that the proposed methods are effective and have potential to be applied to industrial practice.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1988)
Balcan, M.F., Broder, A.Z., Zhang, T.: Margin based active learning. In: Proceedings of the 20th Annual Conference on Learning Theory, San Diego, CA, pp. 35–50 (2007)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)
Blum, A., Mitchell, T.: Combining labeled and unlabeled datawith co-training. In: Proceedings of the 11th Annual Conf. on Computational Learning Theory, pp. 92–100 (1998)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chapelle, O., Zien, A.: Semi-supervised learning by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Savannah Hotel, Barbados, pp. 57–64 (2005)
Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006)
Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, pp. 150–157 (1994)
Freund, Y.H.S., Seung, E.S., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133–168 (1997)
Gibbons, J.D.: Nonparametric Statistical Inference. Marcel Dekker, New York (1985)
Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 327–334 (2000)
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17, pp. 529–536. MIT Press, Cambridge (2005)
Hassan, E.: Predicting faults using the complexity of code changes. In: Proceedings of the 31th International Conference on Software Engineering, Vancouver, Canada, pp. 78–88 (2009)
Jiang, Y., Li, M., Zhou, Z.-H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, pp. 200–209 (1999)
Kim, S., Zimmermann, T., Whitehead, E., Zeller, J.A.: Predicting faults from cached history. In: Proceedings of ICSE’07, Minneapolis, USA, pp. 489–498 (2007)
Koru, L.H., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3–12 (1994)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, pp. 148–156 (1994)
Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern., Part A, Syst. Hum. 37(6), 1088–1098 (2007)
Li, M., Li, H., Zhou, Z.H.: Semi-supervised document retrieval. Inf. Process. Manag. 45(3), 341–355 (2009)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
Miller, D.J., Uyar, H.S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, pp. 571–577. MIT Press, Cambridge (1997)
Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th International Conference on Machine Learning, Sydney, Australia, pp. 435–442 (2002)
Nagappan, N., Ball, T., Zeller, A.: Mining metrics to predict component failures. In: Proceedings of ICSE’06, Shanghai, China, pp. 452–461 (2006)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Seung, H., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the 5th ACM Workshop on Computational Learning Theory, Pittsburgh, PA, pp. 287–294 (1992)
Shahshahani, B., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)
Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Bootstrapping statistical parsers from small data sets. In: Proceedings of the 11th Conference on the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 331–338 (2003)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, pp. 999–1006 (2000)
Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 14, 540–578 (2009). doi:10.1007/s10515-011-0092-1. See http://portal.acm.org/citation.cfm?id=1612763.1612782
Wang, W., Zhou, Z.H.: On multi-view active learning and the combination with semi-supervised learning. In: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, pp. 1152–1159 (2008)
Xu, J.M., Fumera, G., Roli, F., Zhou, Z.H.: Training spam assassin with active semi-supervised learning. In: Proceedings of the 6th Conference on Email and Anti-Spam, Mountain View, CA (2009)
Zhang, H.: An investigation of the relationships between lines of code and defects. In: Proceedings of 25th IEEE International Conference on Software Maintenance, Edmonton, Canada, pp. 274–283 (2009)
Zhang, H., Wu, R.: Sampling program quality. In: Proceedings of 26th IEEE International Conference on Software Maintenance, Timisoara, Romania, pp. 1–10 (2010)
Zhang, H., Zhang, X.: Comments on “Data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Zhang, H., Zhang, X., Gu, M.: Predicting defective software components from code complexity measures. In: Proceedings of 13th IEEE Pacific Rim International Symposium on Dependable Computing, Australia, pp. 93–96 (2007)
Zhang, H., Nelson, A., Menzies, T.: On the value of learning from defect dense components for software defect prediction. In: Proceedings of International Conference on Predictor Models in Software Engineering, Timisoara, Romania, p. 14 (2010)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
Zhou, Z.-H.: When semi-supervised learning meets ensemble learning. In: Proceedings of 8th International Workshop on Multiple Classifier Systems, Reykjavik, Iceland, pp. 529–538 (2009)
Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)
Zhou, Z.-H., Li, M.: Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24(3), 415–439 (2010)
Zhou, Z.-H., Chen, K.J., Dai, H.B.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Trans. Inf. Sys. 24(2), 219–244 (2006)
Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI (2006). http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, Washington, DC, pp. 912–919 (2003)
Zimmermann, T., Nagappan, N.: Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, pp. 531–540 (2008)
Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of International Conference on Predictor Models in Software Engineering, Minneapolis, USA (2007)
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In: Proceedings of ESEC/FSE 2009, Amsterdam, The Netherlands, pp. 91–100 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, M., Zhang, H., Wu, R. et al. Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19, 201–230 (2012). https://doi.org/10.1007/s10515-011-0092-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-011-0092-1