Skip to main content

A Fault Prediction Model with Limited Fault Data to Improve Test Process

  • Conference paper
Book cover Product-Focused Software Process Improvement (PROFES 2008)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5089))

Abstract

Software fault prediction models are used to identify the fault-prone software modules and produce reliable software. Performance of a software fault prediction model is correlated with available software metrics and fault data. In some occasions, there may be few software modules having fault data and therefore, prediction models using only labeled data can not provide accurate results. Semi-supervised learning approaches which benefit from unlabeled and labeled data may be applied in this case. In this paper, we propose an artificial immune system based semi-supervised learning approach. Proposed approach uses a recent semi-supervised algorithm called YATSI (Yet Another Two Stage Idea) and in the first stage of YATSI, AIRS (Artificial Immune Recognition Systems) is applied. In addition, AIRS, RF (Random Forests) classifier, AIRS based YATSI, and RF based YATSI are benchmarked. Experimental results showed that while YATSI algorithm improved the performance of AIRS, it diminished the performance of RF for unbalanced datasets. Furthermore, performance of AIRS based YATSI is comparable with RF which is the best machine learning classifier according to some researches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tian, J.: Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement. John Wiley and Sons Inc., Hoboken (2005)

    Google Scholar 

  2. http://en.wikipedia.org/wiki/Source_lines_of_code#_note-1 (Retrieved on 06-10-2007)

  3. http://www.macintouch.com/specialreports/wwdc2006/ (Retrieved on 2007-10-06)

  4. http://www.linuxdevices.com/news/NS9334092346.html (Retrieved on 2007-10-06)

  5. Northrop, L., Feiler, P., Gabriel, R.P., Goodenough, J., Linger, R., Longstaff, T., Kazman, R., Klein, M., Schmidt, D., Sullivan, K., Wallnau, K.: Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh (2006)

    Google Scholar 

  6. http://www.ismwv.com (Retrieved on 2007-10-06)

  7. Khoshgoftaar, T.M., Seliya, N.: Tree-based Software Quality Models for Fault Prediction. In: Proc. 8th Intl. Software Metrics Sym., Canada, pp. 203–214 (2002)

    Google Scholar 

  8. Chidamber, S.R., Kemerer, C.F.: A Metrics Suite for Object-Oriented Design. IEEE Trans. on Software Eng. 20(6), 476–493 (1994)

    Article  Google Scholar 

  9. Sayyad, S.J., Menzies, T.J.: The PROMISE Repository of Software Engineering Databases. University of Ottawa, Canada (2005), http://promise.site.uottawa.ca/SERepository

  10. Catal, C., Diri, B.: Software Fault Prediction with Object-Oriented Metrics Based Artificial Immune Recognition System. In: 8th Intl. Conf. on Product Focused Software Process Improvement, pp. 300–314. Springer, Latvia (2007)

    Chapter  Google Scholar 

  11. Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Unsupervised Learning for Expert-Based Software Quality Estimation. In: Proc. 8th Intl. Symp. on High Assurance Systems Engineering, Tampa, FL, USA, pp. 149–155 (2004)

    Google Scholar 

  12. Seliya, N., Khoshgoftaar, T.M., Zhong, S.: Semi-Supervised Learning for Software Quality Estimation. In: Proc. 16th IEEE Intl. Conf. on Tools with Artificial Intelligence, Boca Raton, FL, pp. 183–190 (2004)

    Google Scholar 

  13. Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C.: Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In: Proc. 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 60–69 (2006)

    Google Scholar 

  14. Ma, Y., Guo, L., Cukic, B.: A Statistical Framework for the Prediction of Fault-Proneness. In: Advances in Machine Learning Application in Software Eng. Idea Group Inc. (2006)

    Google Scholar 

  15. Guo, L., Ma, Y., Cukic, B., Singh, H.: Robust Prediction of Fault-Proneness by Random Forests. In: Proc. 15th Intl. Symp. on Software Reliability Eng., Brittany, France, pp. 417–428 (2004)

    Google Scholar 

  16. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005)

    Google Scholar 

  17. http://www.rulequest.com/see5-info.html (Retrieved on 2007-10-06)

  18. Evett, M., Khoshgoftaar, T., Chien, P., Allen, E.: GP-based Software Quality Prediction. In: Proc. 3rd Annual Genetic Programming Conference, San Francisco, pp. 60–65 (1998)

    Google Scholar 

  19. Khoshgoftaar, T.M., Seliya, N.: Software Quality Classification Modeling Using The SPRINT Decision Tree Algorithm. In: Proc. 4th IEEE International Conference on Tools with Artificial Intelligence, Washington, pp. 365–374 (2002)

    Google Scholar 

  20. Thwin, M.M., Quah, T.: Application of Neural Networks for Software Quality Prediction Using Object-Oriented Metrics. In: Proc. 19th International Conference on Software Maintenance, Amsterdam, The Netherlands, pp. 113–122 (2003)

    Google Scholar 

  21. Menzies, T., Greenwald, J., Frank, A.: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33(1), 2–13 (2007)

    Article  Google Scholar 

  22. Guo, L., Cukic, B., Singh, H.: Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. In: Proc. 18th IEEE International Conference on Automated Software Engineering, pp. 249–252. IEEE Computer Society, Montreal (2003)

    Chapter  Google Scholar 

  23. El Emam, K., Benlarbi, S., Goel, N., Rai, S.: Comparing Case-based Reasoning Classifiers for Predicting High Risk Software Components. Journal of Systems and Software 55(3), 301–320 (2001)

    Article  Google Scholar 

  24. Yuan, X., Khoshgoftaar, T.M., Allen, E.B., Ganesan, K.: An Application of Fuzzy Clustering to Software Quality Prediction. In: Proc. 3rd IEEE Symp. on Application-Specific Systems and Software Eng. Technology, vol. 85. IEEE Computer Society, Washington (2000)

    Google Scholar 

  25. Catal, C., Diri, B.: Software Defect Prediction using Artificial Immune Recognition System. In: IASTED Intl. Conf. on Software Engineering, Innsbruck, Austria, pp. 285–290 (2007)

    Google Scholar 

  26. Huang, T.M., Kecman, V.: Performance Comparisons of Semi-Supervised Learning Algorithms. In: Proc. Workshop on Learning with Partially Classified Training Data, Intl. Conf. on Machine Learning, Germany, pp. 45–49 (2005)

    Google Scholar 

  27. Zhu, X.: Semi-supervised learning literature survey (Technical Report 1530). University of Wisconsin-Madison (2005), www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf

  28. Chapelle, O., Schölkopf, B., Zien, A.: SemiSupervised Learning. MIT Press (2006)

    Google Scholar 

  29. Scudder, H.J.: Probability of Error of Some Adaptive Pattern-Recognition Machines. IEEE Trans. on Information Theory 11, 363–371 (1965)

    Article  MATH  MathSciNet  Google Scholar 

  30. Fralick, S.C.: Learning to Recognize Patterns without a Teacher. IEEE Trans. on Information Theory 13, 57–64 (1967)

    Article  Google Scholar 

  31. Agrawala, A.K.: Learning with a Probabilistic Teacher. IEEE Trans. on Information Theory 16, 373–379 (1970)

    Article  MATH  MathSciNet  Google Scholar 

  32. Cozman, F.G., Cohen, I., Cirelo, M.C.: Semi-supervised Learning of Mixture Models. In: Intl. Conference on Machine Learning, Washington, USA, pp. 99–106 (2003)

    Google Scholar 

  33. Baluja, S.: Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data. In: Neural Infor. Proc. Syst., Colorado, USA, pp. 854–860 (1998)

    Google Scholar 

  34. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–144 (2000)

    Article  MATH  Google Scholar 

  35. Miller, D.J., Uyar, H.S.: A Mixture of Experts Classifier with Learning based on Both Labeled and Unlabelled Data. In: Neural Infor. Proc. Systems, Colorado, USA, pp. 571–577 (1996)

    Google Scholar 

  36. Goldman, S., Zhou, Y.: Enhancing Supervised Learning with Unlabeled Data. In: 17th Int. Joint Conf. on Machine Learning, Stanford, pp. 327–334 (2000)

    Google Scholar 

  37. Bennett, K.P., Demiriz, A.: Semi-supervised Support Vector Machines. In: Proc. Advances in Neural information Processing Systems, pp. 368–374. MIT Press, Cambridge (1999)

    Google Scholar 

  38. Cozman, F.G., Cohen, I.: Unlabeled Data can Degrade Classification Performance of Generative Classifiers. In: Florida Art. Intel. Research Society, Florida, pp. 327–331 (2002)

    Google Scholar 

  39. Shahshahani, B.M., Landgrebe, D.A.: The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Trans. on Geoscience and Remote Sensing 32, 1087–1095 (1994)

    Article  Google Scholar 

  40. Bruce, R.: Semi-supervised Learning using Prior Probabilities and EM. In: IJCAI Workshop on Text Learning, pp. 17–22 (2001)

    Google Scholar 

  41. Elworthy, D.: Does Baum-Welch Re-estimation Help Taggers? In: 4th Conf. on Applied Natural Language Processing, Stuttgart, Germany, pp. 53–58 (1994)

    Google Scholar 

  42. Vapnik, V., Chervonenkis, A.: Theory of Pattern Recognition, Nauka, Moscow (1974)

    Google Scholar 

  43. Yarowsky, D.: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proc. 33rd Ann. Meeting of the Assoc. for Compt. Linguistics, pp. 189–196. Cambridge (1995)

    Google Scholar 

  44. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proc. 11th Annual Conf. on Computational Learning Theory, Wisconsin, pp. 92–100 (1998)

    Google Scholar 

  45. Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-training. In: Ninth Intl. Conf. on Information and Knowledge Management, Washington, pp. 86–93 (2000)

    Google Scholar 

  46. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proc. Intl. Conference on Machine Learning, Slovenia, pp. 200–209 (1999)

    Google Scholar 

  47. Blum, A., Chawla, S.: Learning from Labeled and Unlabeled Data using Graph Mincuts. In: Proc. 18th Intl. Conference on Machine Learning, Massachusetts, USA, pp. 19–26 (2001)

    Google Scholar 

  48. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In: 20th Intl. Conf. on Mach. Learning, Washington, pp. 912–919 (2003)

    Google Scholar 

  49. Watkins, A.: AIRS: A Resource Limited Artificial Immune Classifier, Master Thesis, Mississippi State University (2001)

    Google Scholar 

  50. Timmis, J., Neal, M.: Investigating the Evolution and Stability of a Resource Limited Artificial Immune Systems. In: Genetic and Evo. Compt. Conf., Nevada, pp. 40–41 (2000)

    Google Scholar 

  51. De Castro, L.N., Von Zubben, F.J.: The Clonal Selection Algorithm with Engineering Applications. In: Genetic and Evolutionary Computation Conference, pp. 36–37 (2000)

    Google Scholar 

  52. Brownlee, J.: Artificial Immune Recognition System: A Review and Analysis, Technical Report. No 1-02, Swinburne University of Technology (2005)

    Google Scholar 

  53. Jin, X., Bie, R.: Random Forest and PCA for Self-Organizing Maps based Automatic Music Genre Discrimination. In: Intl. Conference on Data Mining, Las Vegas, pp. 414–417 (2006)

    Google Scholar 

  54. http://en.wikipedia.org/wiki/Random_forest (Retrieved on 2007-10-06)

  55. Bradley, A.P.: The use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)

    Article  Google Scholar 

  56. Ling, C.X., Huang, J., Zhang, H.: AUC: A Better Measure than Accuracy in Comparing Learning Algorithms. In: Canadian Conference on Artificial Intelligence, pp. 329–341 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andreas Jedlitschka Outi Salo

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Catal, C., Diri, B. (2008). A Fault Prediction Model with Limited Fault Data to Improve Test Process. In: Jedlitschka, A., Salo, O. (eds) Product-Focused Software Process Improvement. PROFES 2008. Lecture Notes in Computer Science, vol 5089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69566-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69566-0_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69564-6

  • Online ISBN: 978-3-540-69566-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics