Abstract
Software fault prediction models are used to identify the fault-prone software modules and produce reliable software. Performance of a software fault prediction model is correlated with available software metrics and fault data. In some occasions, there may be few software modules having fault data and therefore, prediction models using only labeled data can not provide accurate results. Semi-supervised learning approaches which benefit from unlabeled and labeled data may be applied in this case. In this paper, we propose an artificial immune system based semi-supervised learning approach. Proposed approach uses a recent semi-supervised algorithm called YATSI (Yet Another Two Stage Idea) and in the first stage of YATSI, AIRS (Artificial Immune Recognition Systems) is applied. In addition, AIRS, RF (Random Forests) classifier, AIRS based YATSI, and RF based YATSI are benchmarked. Experimental results showed that while YATSI algorithm improved the performance of AIRS, it diminished the performance of RF for unbalanced datasets. Furthermore, performance of AIRS based YATSI is comparable with RF which is the best machine learning classifier according to some researches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tian, J.: Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement. John Wiley and Sons Inc., Hoboken (2005)
http://en.wikipedia.org/wiki/Source_lines_of_code#_note-1 (Retrieved on 06-10-2007)
http://www.macintouch.com/specialreports/wwdc2006/ (Retrieved on 2007-10-06)
http://www.linuxdevices.com/news/NS9334092346.html (Retrieved on 2007-10-06)
Northrop, L., Feiler, P., Gabriel, R.P., Goodenough, J., Linger, R., Longstaff, T., Kazman, R., Klein, M., Schmidt, D., Sullivan, K., Wallnau, K.: Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh (2006)
http://www.ismwv.com (Retrieved on 2007-10-06)
Khoshgoftaar, T.M., Seliya, N.: Tree-based Software Quality Models for Fault Prediction. In: Proc. 8th Intl. Software Metrics Sym., Canada, pp. 203–214 (2002)
Chidamber, S.R., Kemerer, C.F.: A Metrics Suite for Object-Oriented Design. IEEE Trans. on Software Eng. 20(6), 476–493 (1994)
Sayyad, S.J., Menzies, T.J.: The PROMISE Repository of Software Engineering Databases. University of Ottawa, Canada (2005), http://promise.site.uottawa.ca/SERepository
Catal, C., Diri, B.: Software Fault Prediction with Object-Oriented Metrics Based Artificial Immune Recognition System. In: 8th Intl. Conf. on Product Focused Software Process Improvement, pp. 300–314. Springer, Latvia (2007)
Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Unsupervised Learning for Expert-Based Software Quality Estimation. In: Proc. 8th Intl. Symp. on High Assurance Systems Engineering, Tampa, FL, USA, pp. 149–155 (2004)
Seliya, N., Khoshgoftaar, T.M., Zhong, S.: Semi-Supervised Learning for Software Quality Estimation. In: Proc. 16th IEEE Intl. Conf. on Tools with Artificial Intelligence, Boca Raton, FL, pp. 183–190 (2004)
Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C.: Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In: Proc. 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 60–69 (2006)
Ma, Y., Guo, L., Cukic, B.: A Statistical Framework for the Prediction of Fault-Proneness. In: Advances in Machine Learning Application in Software Eng. Idea Group Inc. (2006)
Guo, L., Ma, Y., Cukic, B., Singh, H.: Robust Prediction of Fault-Proneness by Random Forests. In: Proc. 15th Intl. Symp. on Software Reliability Eng., Brittany, France, pp. 417–428 (2004)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005)
http://www.rulequest.com/see5-info.html (Retrieved on 2007-10-06)
Evett, M., Khoshgoftaar, T., Chien, P., Allen, E.: GP-based Software Quality Prediction. In: Proc. 3rd Annual Genetic Programming Conference, San Francisco, pp. 60–65 (1998)
Khoshgoftaar, T.M., Seliya, N.: Software Quality Classification Modeling Using The SPRINT Decision Tree Algorithm. In: Proc. 4th IEEE International Conference on Tools with Artificial Intelligence, Washington, pp. 365–374 (2002)
Thwin, M.M., Quah, T.: Application of Neural Networks for Software Quality Prediction Using Object-Oriented Metrics. In: Proc. 19th International Conference on Software Maintenance, Amsterdam, The Netherlands, pp. 113–122 (2003)
Menzies, T., Greenwald, J., Frank, A.: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33(1), 2–13 (2007)
Guo, L., Cukic, B., Singh, H.: Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. In: Proc. 18th IEEE International Conference on Automated Software Engineering, pp. 249–252. IEEE Computer Society, Montreal (2003)
El Emam, K., Benlarbi, S., Goel, N., Rai, S.: Comparing Case-based Reasoning Classifiers for Predicting High Risk Software Components. Journal of Systems and Software 55(3), 301–320 (2001)
Yuan, X., Khoshgoftaar, T.M., Allen, E.B., Ganesan, K.: An Application of Fuzzy Clustering to Software Quality Prediction. In: Proc. 3rd IEEE Symp. on Application-Specific Systems and Software Eng. Technology, vol. 85. IEEE Computer Society, Washington (2000)
Catal, C., Diri, B.: Software Defect Prediction using Artificial Immune Recognition System. In: IASTED Intl. Conf. on Software Engineering, Innsbruck, Austria, pp. 285–290 (2007)
Huang, T.M., Kecman, V.: Performance Comparisons of Semi-Supervised Learning Algorithms. In: Proc. Workshop on Learning with Partially Classified Training Data, Intl. Conf. on Machine Learning, Germany, pp. 45–49 (2005)
Zhu, X.: Semi-supervised learning literature survey (Technical Report 1530). University of Wisconsin-Madison (2005), www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Chapelle, O., Schölkopf, B., Zien, A.: SemiSupervised Learning. MIT Press (2006)
Scudder, H.J.: Probability of Error of Some Adaptive Pattern-Recognition Machines. IEEE Trans. on Information Theory 11, 363–371 (1965)
Fralick, S.C.: Learning to Recognize Patterns without a Teacher. IEEE Trans. on Information Theory 13, 57–64 (1967)
Agrawala, A.K.: Learning with a Probabilistic Teacher. IEEE Trans. on Information Theory 16, 373–379 (1970)
Cozman, F.G., Cohen, I., Cirelo, M.C.: Semi-supervised Learning of Mixture Models. In: Intl. Conference on Machine Learning, Washington, USA, pp. 99–106 (2003)
Baluja, S.: Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data. In: Neural Infor. Proc. Syst., Colorado, USA, pp. 854–860 (1998)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–144 (2000)
Miller, D.J., Uyar, H.S.: A Mixture of Experts Classifier with Learning based on Both Labeled and Unlabelled Data. In: Neural Infor. Proc. Systems, Colorado, USA, pp. 571–577 (1996)
Goldman, S., Zhou, Y.: Enhancing Supervised Learning with Unlabeled Data. In: 17th Int. Joint Conf. on Machine Learning, Stanford, pp. 327–334 (2000)
Bennett, K.P., Demiriz, A.: Semi-supervised Support Vector Machines. In: Proc. Advances in Neural information Processing Systems, pp. 368–374. MIT Press, Cambridge (1999)
Cozman, F.G., Cohen, I.: Unlabeled Data can Degrade Classification Performance of Generative Classifiers. In: Florida Art. Intel. Research Society, Florida, pp. 327–331 (2002)
Shahshahani, B.M., Landgrebe, D.A.: The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Trans. on Geoscience and Remote Sensing 32, 1087–1095 (1994)
Bruce, R.: Semi-supervised Learning using Prior Probabilities and EM. In: IJCAI Workshop on Text Learning, pp. 17–22 (2001)
Elworthy, D.: Does Baum-Welch Re-estimation Help Taggers? In: 4th Conf. on Applied Natural Language Processing, Stuttgart, Germany, pp. 53–58 (1994)
Vapnik, V., Chervonenkis, A.: Theory of Pattern Recognition, Nauka, Moscow (1974)
Yarowsky, D.: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proc. 33rd Ann. Meeting of the Assoc. for Compt. Linguistics, pp. 189–196. Cambridge (1995)
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proc. 11th Annual Conf. on Computational Learning Theory, Wisconsin, pp. 92–100 (1998)
Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-training. In: Ninth Intl. Conf. on Information and Knowledge Management, Washington, pp. 86–93 (2000)
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proc. Intl. Conference on Machine Learning, Slovenia, pp. 200–209 (1999)
Blum, A., Chawla, S.: Learning from Labeled and Unlabeled Data using Graph Mincuts. In: Proc. 18th Intl. Conference on Machine Learning, Massachusetts, USA, pp. 19–26 (2001)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In: 20th Intl. Conf. on Mach. Learning, Washington, pp. 912–919 (2003)
Watkins, A.: AIRS: A Resource Limited Artificial Immune Classifier, Master Thesis, Mississippi State University (2001)
Timmis, J., Neal, M.: Investigating the Evolution and Stability of a Resource Limited Artificial Immune Systems. In: Genetic and Evo. Compt. Conf., Nevada, pp. 40–41 (2000)
De Castro, L.N., Von Zubben, F.J.: The Clonal Selection Algorithm with Engineering Applications. In: Genetic and Evolutionary Computation Conference, pp. 36–37 (2000)
Brownlee, J.: Artificial Immune Recognition System: A Review and Analysis, Technical Report. No 1-02, Swinburne University of Technology (2005)
Jin, X., Bie, R.: Random Forest and PCA for Self-Organizing Maps based Automatic Music Genre Discrimination. In: Intl. Conference on Data Mining, Las Vegas, pp. 414–417 (2006)
http://en.wikipedia.org/wiki/Random_forest (Retrieved on 2007-10-06)
Bradley, A.P.: The use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)
Ling, C.X., Huang, J., Zhang, H.: AUC: A Better Measure than Accuracy in Comparing Learning Algorithms. In: Canadian Conference on Artificial Intelligence, pp. 329–341 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Catal, C., Diri, B. (2008). A Fault Prediction Model with Limited Fault Data to Improve Test Process. In: Jedlitschka, A., Salo, O. (eds) Product-Focused Software Process Improvement. PROFES 2008. Lecture Notes in Computer Science, vol 5089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69566-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-540-69566-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69564-6
Online ISBN: 978-3-540-69566-0
eBook Packages: Computer ScienceComputer Science (R0)