A Fault Prediction Model with Limited Fault Data to Improve Test Process

Catal, Cagatay; Diri, Banu

doi:10.1007/978-3-540-69566-0_21

Cagatay Catal¹ &
Banu Diri²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5089))

Included in the following conference series:

International Conference on Product Focused Software Process Improvement

1919 Accesses
19 Citations

Abstract

Software fault prediction models are used to identify the fault-prone software modules and produce reliable software. Performance of a software fault prediction model is correlated with available software metrics and fault data. In some occasions, there may be few software modules having fault data and therefore, prediction models using only labeled data can not provide accurate results. Semi-supervised learning approaches which benefit from unlabeled and labeled data may be applied in this case. In this paper, we propose an artificial immune system based semi-supervised learning approach. Proposed approach uses a recent semi-supervised algorithm called YATSI (Yet Another Two Stage Idea) and in the first stage of YATSI, AIRS (Artificial Immune Recognition Systems) is applied. In addition, AIRS, RF (Random Forests) classifier, AIRS based YATSI, and RF based YATSI are benchmarked. Experimental results showed that while YATSI algorithm improved the performance of AIRS, it diminished the performance of RF for unbalanced datasets. Furthermore, performance of AIRS based YATSI is comparable with RF which is the best machine learning classifier according to some researches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tian, J.: Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement. John Wiley and Sons Inc., Hoboken (2005)
Google Scholar
http://en.wikipedia.org/wiki/Source_lines_of_code#_note-1 (Retrieved on 06-10-2007)
http://www.macintouch.com/specialreports/wwdc2006/ (Retrieved on 2007-10-06)
http://www.linuxdevices.com/news/NS9334092346.html (Retrieved on 2007-10-06)
Northrop, L., Feiler, P., Gabriel, R.P., Goodenough, J., Linger, R., Longstaff, T., Kazman, R., Klein, M., Schmidt, D., Sullivan, K., Wallnau, K.: Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh (2006)
Google Scholar
http://www.ismwv.com (Retrieved on 2007-10-06)
Khoshgoftaar, T.M., Seliya, N.: Tree-based Software Quality Models for Fault Prediction. In: Proc. 8th Intl. Software Metrics Sym., Canada, pp. 203–214 (2002)
Google Scholar
Chidamber, S.R., Kemerer, C.F.: A Metrics Suite for Object-Oriented Design. IEEE Trans. on Software Eng. 20(6), 476–493 (1994)
Article Google Scholar
Sayyad, S.J., Menzies, T.J.: The PROMISE Repository of Software Engineering Databases. University of Ottawa, Canada (2005), http://promise.site.uottawa.ca/SERepository
Catal, C., Diri, B.: Software Fault Prediction with Object-Oriented Metrics Based Artificial Immune Recognition System. In: 8th Intl. Conf. on Product Focused Software Process Improvement, pp. 300–314. Springer, Latvia (2007)
Chapter Google Scholar
Zhong, S., Khoshgoftaar, T.M., Seliya, N.: Unsupervised Learning for Expert-Based Software Quality Estimation. In: Proc. 8th Intl. Symp. on High Assurance Systems Engineering, Tampa, FL, USA, pp. 149–155 (2004)
Google Scholar
Seliya, N., Khoshgoftaar, T.M., Zhong, S.: Semi-Supervised Learning for Software Quality Estimation. In: Proc. 16th IEEE Intl. Conf. on Tools with Artificial Intelligence, Boca Raton, FL, pp. 183–190 (2004)
Google Scholar
Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C.: Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In: Proc. 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 60–69 (2006)
Google Scholar
Ma, Y., Guo, L., Cukic, B.: A Statistical Framework for the Prediction of Fault-Proneness. In: Advances in Machine Learning Application in Software Eng. Idea Group Inc. (2006)
Google Scholar
Guo, L., Ma, Y., Cukic, B., Singh, H.: Robust Prediction of Fault-Proneness by Random Forests. In: Proc. 15th Intl. Symp. on Software Reliability Eng., Brittany, France, pp. 417–428 (2004)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005)
Google Scholar
http://www.rulequest.com/see5-info.html (Retrieved on 2007-10-06)
Evett, M., Khoshgoftaar, T., Chien, P., Allen, E.: GP-based Software Quality Prediction. In: Proc. 3rd Annual Genetic Programming Conference, San Francisco, pp. 60–65 (1998)
Google Scholar
Khoshgoftaar, T.M., Seliya, N.: Software Quality Classification Modeling Using The SPRINT Decision Tree Algorithm. In: Proc. 4th IEEE International Conference on Tools with Artificial Intelligence, Washington, pp. 365–374 (2002)
Google Scholar
Thwin, M.M., Quah, T.: Application of Neural Networks for Software Quality Prediction Using Object-Oriented Metrics. In: Proc. 19th International Conference on Software Maintenance, Amsterdam, The Netherlands, pp. 113–122 (2003)
Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33(1), 2–13 (2007)
Article Google Scholar
Guo, L., Cukic, B., Singh, H.: Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. In: Proc. 18th IEEE International Conference on Automated Software Engineering, pp. 249–252. IEEE Computer Society, Montreal (2003)
Chapter Google Scholar
El Emam, K., Benlarbi, S., Goel, N., Rai, S.: Comparing Case-based Reasoning Classifiers for Predicting High Risk Software Components. Journal of Systems and Software 55(3), 301–320 (2001)
Article Google Scholar
Yuan, X., Khoshgoftaar, T.M., Allen, E.B., Ganesan, K.: An Application of Fuzzy Clustering to Software Quality Prediction. In: Proc. 3rd IEEE Symp. on Application-Specific Systems and Software Eng. Technology, vol. 85. IEEE Computer Society, Washington (2000)
Google Scholar
Catal, C., Diri, B.: Software Defect Prediction using Artificial Immune Recognition System. In: IASTED Intl. Conf. on Software Engineering, Innsbruck, Austria, pp. 285–290 (2007)
Google Scholar
Huang, T.M., Kecman, V.: Performance Comparisons of Semi-Supervised Learning Algorithms. In: Proc. Workshop on Learning with Partially Classified Training Data, Intl. Conf. on Machine Learning, Germany, pp. 45–49 (2005)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey (Technical Report 1530). University of Wisconsin-Madison (2005), www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Chapelle, O., Schölkopf, B., Zien, A.: SemiSupervised Learning. MIT Press (2006)
Google Scholar
Scudder, H.J.: Probability of Error of Some Adaptive Pattern-Recognition Machines. IEEE Trans. on Information Theory 11, 363–371 (1965)
Article MATH MathSciNet Google Scholar
Fralick, S.C.: Learning to Recognize Patterns without a Teacher. IEEE Trans. on Information Theory 13, 57–64 (1967)
Article Google Scholar
Agrawala, A.K.: Learning with a Probabilistic Teacher. IEEE Trans. on Information Theory 16, 373–379 (1970)
Article MATH MathSciNet Google Scholar
Cozman, F.G., Cohen, I., Cirelo, M.C.: Semi-supervised Learning of Mixture Models. In: Intl. Conference on Machine Learning, Washington, USA, pp. 99–106 (2003)
Google Scholar
Baluja, S.: Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data. In: Neural Infor. Proc. Syst., Colorado, USA, pp. 854–860 (1998)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–144 (2000)
Article MATH Google Scholar
Miller, D.J., Uyar, H.S.: A Mixture of Experts Classifier with Learning based on Both Labeled and Unlabelled Data. In: Neural Infor. Proc. Systems, Colorado, USA, pp. 571–577 (1996)
Google Scholar
Goldman, S., Zhou, Y.: Enhancing Supervised Learning with Unlabeled Data. In: 17th Int. Joint Conf. on Machine Learning, Stanford, pp. 327–334 (2000)
Google Scholar
Bennett, K.P., Demiriz, A.: Semi-supervised Support Vector Machines. In: Proc. Advances in Neural information Processing Systems, pp. 368–374. MIT Press, Cambridge (1999)
Google Scholar
Cozman, F.G., Cohen, I.: Unlabeled Data can Degrade Classification Performance of Generative Classifiers. In: Florida Art. Intel. Research Society, Florida, pp. 327–331 (2002)
Google Scholar
Shahshahani, B.M., Landgrebe, D.A.: The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Trans. on Geoscience and Remote Sensing 32, 1087–1095 (1994)
Article Google Scholar
Bruce, R.: Semi-supervised Learning using Prior Probabilities and EM. In: IJCAI Workshop on Text Learning, pp. 17–22 (2001)
Google Scholar
Elworthy, D.: Does Baum-Welch Re-estimation Help Taggers? In: 4th Conf. on Applied Natural Language Processing, Stuttgart, Germany, pp. 53–58 (1994)
Google Scholar
Vapnik, V., Chervonenkis, A.: Theory of Pattern Recognition, Nauka, Moscow (1974)
Google Scholar
Yarowsky, D.: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proc. 33rd Ann. Meeting of the Assoc. for Compt. Linguistics, pp. 189–196. Cambridge (1995)
Google Scholar
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proc. 11th Annual Conf. on Computational Learning Theory, Wisconsin, pp. 92–100 (1998)
Google Scholar
Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-training. In: Ninth Intl. Conf. on Information and Knowledge Management, Washington, pp. 86–93 (2000)
Google Scholar
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proc. Intl. Conference on Machine Learning, Slovenia, pp. 200–209 (1999)
Google Scholar
Blum, A., Chawla, S.: Learning from Labeled and Unlabeled Data using Graph Mincuts. In: Proc. 18th Intl. Conference on Machine Learning, Massachusetts, USA, pp. 19–26 (2001)
Google Scholar
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In: 20th Intl. Conf. on Mach. Learning, Washington, pp. 912–919 (2003)
Google Scholar
Watkins, A.: AIRS: A Resource Limited Artificial Immune Classifier, Master Thesis, Mississippi State University (2001)
Google Scholar
Timmis, J., Neal, M.: Investigating the Evolution and Stability of a Resource Limited Artificial Immune Systems. In: Genetic and Evo. Compt. Conf., Nevada, pp. 40–41 (2000)
Google Scholar
De Castro, L.N., Von Zubben, F.J.: The Clonal Selection Algorithm with Engineering Applications. In: Genetic and Evolutionary Computation Conference, pp. 36–37 (2000)
Google Scholar
Brownlee, J.: Artificial Immune Recognition System: A Review and Analysis, Technical Report. No 1-02, Swinburne University of Technology (2005)
Google Scholar
Jin, X., Bie, R.: Random Forest and PCA for Self-Organizing Maps based Automatic Music Genre Discrimination. In: Intl. Conference on Data Mining, Las Vegas, pp. 414–417 (2006)
Google Scholar
http://en.wikipedia.org/wiki/Random_forest (Retrieved on 2007-10-06)
Bradley, A.P.: The use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)
Article Google Scholar
Ling, C.X., Huang, J., Zhang, H.: AUC: A Better Measure than Accuracy in Comparing Learning Algorithms. In: Canadian Conference on Artificial Intelligence, pp. 329–341 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

The Scientific and Technological Research Council of TURKEY, Marmara Research Center, Information Technologies Institute, , Kocaeli, Turkey
Cagatay Catal
Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
Banu Diri

Authors

Cagatay Catal
View author publications
You can also search for this author in PubMed Google Scholar
Banu Diri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andreas Jedlitschka Outi Salo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Catal, C., Diri, B. (2008). A Fault Prediction Model with Limited Fault Data to Improve Test Process. In: Jedlitschka, A., Salo, O. (eds) Product-Focused Software Process Improvement. PROFES 2008. Lecture Notes in Computer Science, vol 5089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69566-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-69566-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69564-6
Online ISBN: 978-3-540-69566-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics