Skip to main content
Log in

Fault-prone module detection using large-scale text features based on spam filtering

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

This paper proposes an approach using large-scale text features for fault-prone module detection inspired by spam filtering. The number of every text feature in the source code of a module is counted and used as data for training detection models. In this paper, we prepared a naive Bayes classifier and a logistic regression model as detection models. To show the effectiveness of our approaches, we conducted experiments with five open source projects and compared them with a well-known metrics set, thereby achieving higher detection results. The results imply that large-scale text features are useful in constructing practical detection models, and measuring sophisticated metrics is not always necessary for detecting fault-prone modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. java weka.filters.unsupervised.attribute.StringToWordVector -C -W 5000

  2. http://www.eclipse.org/

References

  • Aversano L, Cerulo L, Grosso CD (2007) Learning from bug-introducing changes to prevent fault prone code. In: Proc. of 9th international workshop on principles of software evolution. ACM, New York, pp 19–26

    Google Scholar 

  • Basili VR, Briand LC, Melo WL (1996) A validation of object oriented metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761

    Article  Google Scholar 

  • Bellini P, Bruno I, Nesi P, Rogai D (2005) Comparing fault-proneness estimation models. In: Proc. of 10th IEEE international conference on engineering of complex computer systems, pp 205–214

  • Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720

    Article  Google Scholar 

  • Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493

    Article  Google Scholar 

  • Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc. of 24th international conference on software engineering, pp 241–251

  • Fowler M, Beck K (1999) Refactoring: improving the design of existing code. Addison-Wesley Longman, Boston

    Google Scholar 

  • Graves TL, Karr AF, Marron J, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661

    Article  Google Scholar 

  • Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proc. of 18st international conference on automated software engineering, pp 249–252

  • Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910

    Article  Google Scholar 

  • Halstead MH (1977) Elements of software science. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proc. of 21st IEEE international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 263–272

    Chapter  Google Scholar 

  • Herraiz I, German DM, Gonzalez-Barahona JM, Robles G (2008) Towards a simplification of the bug report form in eclipse. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 145–148

    Chapter  Google Scholar 

  • Higo Y, Murao K, Kusumoto S, Inoue K (2008) Predicting fault-prone modules based on metrics transitions. In: Proc. of 2008 workshop on defects in large software systems. ACM, New York, pp 6–10

    Chapter  Google Scholar 

  • Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical study. Empirical Software Engineering 9:229–257

    Article  Google Scholar 

  • Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proc. of 29th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 489–498

    Google Scholar 

  • Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196

    Article  Google Scholar 

  • Layman L, Kudrjavets G, Nagappan N (2008) Iterative identification of fault-prone binaries using in-process metrics. In: Proc. of 2nd international symposium on empirical software engineering and measurement. ACM, New York, pp 206–212

    Chapter  Google Scholar 

  • Li Z, Zhou Y (2005) PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 306–315

    Google Scholar 

  • Livshits B, Zimmermann T (2005) Dynamine: finding common error patterns by mining software revision histories. ACM SIGSOFT Softw Eng Notes 30(5):296–305

    Article  Google Scholar 

  • Madhavan J, Whitehead EJ Jr (2007) Predicting buggy changes inside an integrated development environment. In: Proc. of the 2007 OOPSLA workshop on eclipse technology exchange. ACM, New York, pp 36–40

    Chapter  Google Scholar 

  • Mäntylä M, Vanhanen J, Lassenius C (2003) A taxonomy and an initial empirical study of bad smells in code. In: Proc. of the international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 381–384

    Google Scholar 

  • McCabe TJ (1976) A complexity measure. In: Proc. of 2nd international conference on software engineering. IEEE Computer Society Press, Los Alamitos, p 407

    Google Scholar 

  • Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

    Article  Google Scholar 

  • Mileva YM, Zeller A (2008) Project-specific deletion patterns. In: Proc. of international workshop on recommendation systems for software engineering. ACM, New York, pp 41–42

    Chapter  Google Scholar 

  • Mizuno O, Kikuno T (2007) Training on errors experiment to detect fault-prone software modules by spam filter. In: Proc. of 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 405–414

  • Mizuno O, Ikami S, Nakaichi S, Kikuno T (2007) Spam filter based approach for finding fault-prone software modules. In: Proc. of 4th international workshop on mining software repositories

  • Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proc. of 27th International Conference on Software Engineering, pp 284–292

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proc. of 28th international conference on software engineering. ACM, New York, pp 452–461

    Google Scholar 

  • Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proc. of 14th ACM conference on computer and communications security. ACM, New York, pp 529–540

    Chapter  Google Scholar 

  • Ostrand T, Weyuker E, Bell R (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31:340–355

    Article  Google Scholar 

  • Pan K, Kim S, Whitehead EJ Jr (2009) Toward an understanding of bug fix patterns. Empir Softw Eng 14(3):286–315

    Article  Google Scholar 

  • Ratzinger J, Sigmund T, Gall H (2008) On the relation of refactorings and software defect prediction. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 35–38

    Chapter  Google Scholar 

  • Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proc. of AAAI workshop on learning for text categorization. AAAI Technical Report WS-98-05

  • Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proc. of ACM/IEEE international symposium on empirical software engineering. ACM, New York, pp 18–27

    Chapter  Google Scholar 

  • Seliya N, Khoshgoftaar TM, Zhong S (2005) Analyzing software quality with limited fault-proneness defect data. In: Proc. of 9th IEEE international symposium on high-assurance systems engineering, pp 89–98

  • Śliwerski J, Zimmermann T, Zeller A (2005a) HATARI: raising risk awareness. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 107–110

    Google Scholar 

  • Śliwerski J, Zimmermann T, Zeller A (2005b) When do changes induce fixes? (on Fridays.) In: Proc. of 2nd international workshop on mining software repositories, pp 24–28

  • Williams C, Hollingsworth J (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31:466–480

    Article  Google Scholar 

  • Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco.

    MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to express their thanks to the three anonymous reviewers and the editor for providing insightful and useful suggestions and comments. We also thank the developers of the CRM114 classifier. Without the CRM114, this work could not be conducted. We thank Tatsuya Miyake, Yoshiki Higo, and Katsuro Inoue who implemented a software metrics measurement tool. Finally, the authors also wish to thank the developers of Eclipse who have made the repository of Eclipse available for research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Osamu Mizuno.

Additional information

Editor: Claes Wohlin

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hata, H., Mizuno, O. & Kikuno, T. Fault-prone module detection using large-scale text features based on spam filtering. Empir Software Eng 15, 147–165 (2010). https://doi.org/10.1007/s10664-009-9117-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-009-9117-9

Keywords

Navigation