Fault-prone module detection using large-scale text features based on spam filtering

Hata, Hideaki; Mizuno, Osamu; Kikuno, Tohru

doi:10.1007/s10664-009-9117-9

Fault-prone module detection using large-scale text features based on spam filtering

Published: 12 September 2009

Volume 15, pages 147–165, (2010)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Hideaki Hata²,
Osamu Mizuno¹ &
Tohru Kikuno²

565 Accesses
27 Citations
Explore all metrics

Abstract

This paper proposes an approach using large-scale text features for fault-prone module detection inspired by spam filtering. The number of every text feature in the source code of a module is counted and used as data for training detection models. In this paper, we prepared a naive Bayes classifier and a logistic regression model as detection models. To show the effectiveness of our approaches, we conducted experiments with five open source projects and compared them with a well-known metrics set, thereby achieving higher detection results. The results imply that large-scale text features are useful in constructing practical detection models, and measuring sophisticated metrics is not always necessary for detecting fault-prone modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault-Prone Byte-Code Detection Using Text Classifier

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

A Study of Filter-Based Feature Selection in Software Fault Prediction

Notes

java weka.filters.unsupervised.attribute.StringToWordVector -C -W 5000
http://www.eclipse.org/

References

Aversano L, Cerulo L, Grosso CD (2007) Learning from bug-introducing changes to prevent fault prone code. In: Proc. of 9th international workshop on principles of software evolution. ACM, New York, pp 19–26
Google Scholar
Basili VR, Briand LC, Melo WL (1996) A validation of object oriented metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Article Google Scholar
Bellini P, Bruno I, Nesi P, Rogai D (2005) Comparing fault-proneness estimation models. In: Proc. of 10th IEEE international conference on engineering of complex computer systems, pp 205–214
Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720
Article Google Scholar
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
Article Google Scholar
Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc. of 24th international conference on software engineering, pp 241–251
Fowler M, Beck K (1999) Refactoring: improving the design of existing code. Addison-Wesley Longman, Boston
Google Scholar
Graves TL, Karr AF, Marron J, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Article Google Scholar
Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proc. of 18st international conference on automated software engineering, pp 249–252
Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910
Article Google Scholar
Halstead MH (1977) Elements of software science. Elsevier, Amsterdam
MATH Google Scholar
Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proc. of 21st IEEE international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 263–272
Chapter Google Scholar
Herraiz I, German DM, Gonzalez-Barahona JM, Robles G (2008) Towards a simplification of the bug report form in eclipse. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 145–148
Chapter Google Scholar
Higo Y, Murao K, Kusumoto S, Inoue K (2008) Predicting fault-prone modules based on metrics transitions. In: Proc. of 2008 workshop on defects in large software systems. ACM, New York, pp 6–10
Chapter Google Scholar
Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical study. Empirical Software Engineering 9:229–257
Article Google Scholar
Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proc. of 29th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 489–498
Google Scholar
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Article Google Scholar
Layman L, Kudrjavets G, Nagappan N (2008) Iterative identification of fault-prone binaries using in-process metrics. In: Proc. of 2nd international symposium on empirical software engineering and measurement. ACM, New York, pp 206–212
Chapter Google Scholar
Li Z, Zhou Y (2005) PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 306–315
Google Scholar
Livshits B, Zimmermann T (2005) Dynamine: finding common error patterns by mining software revision histories. ACM SIGSOFT Softw Eng Notes 30(5):296–305
Article Google Scholar
Madhavan J, Whitehead EJ Jr (2007) Predicting buggy changes inside an integrated development environment. In: Proc. of the 2007 OOPSLA workshop on eclipse technology exchange. ACM, New York, pp 36–40
Chapter Google Scholar
Mäntylä M, Vanhanen J, Lassenius C (2003) A taxonomy and an initial empirical study of bad smells in code. In: Proc. of the international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 381–384
Google Scholar
McCabe TJ (1976) A complexity measure. In: Proc. of 2nd international conference on software engineering. IEEE Computer Society Press, Los Alamitos, p 407
Google Scholar
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Article Google Scholar
Mileva YM, Zeller A (2008) Project-specific deletion patterns. In: Proc. of international workshop on recommendation systems for software engineering. ACM, New York, pp 41–42
Chapter Google Scholar
Mizuno O, Kikuno T (2007) Training on errors experiment to detect fault-prone software modules by spam filter. In: Proc. of 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 405–414
Mizuno O, Ikami S, Nakaichi S, Kikuno T (2007) Spam filter based approach for finding fault-prone software modules. In: Proc. of 4th international workshop on mining software repositories
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proc. of 27th International Conference on Software Engineering, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proc. of 28th international conference on software engineering. ACM, New York, pp 452–461
Google Scholar
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proc. of 14th ACM conference on computer and communications security. ACM, New York, pp 529–540
Chapter Google Scholar
Ostrand T, Weyuker E, Bell R (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31:340–355
Article Google Scholar
Pan K, Kim S, Whitehead EJ Jr (2009) Toward an understanding of bug fix patterns. Empir Softw Eng 14(3):286–315
Article Google Scholar
Ratzinger J, Sigmund T, Gall H (2008) On the relation of refactorings and software defect prediction. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 35–38
Chapter Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proc. of AAAI workshop on learning for text categorization. AAAI Technical Report WS-98-05
Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proc. of ACM/IEEE international symposium on empirical software engineering. ACM, New York, pp 18–27
Chapter Google Scholar
Seliya N, Khoshgoftaar TM, Zhong S (2005) Analyzing software quality with limited fault-proneness defect data. In: Proc. of 9th IEEE international symposium on high-assurance systems engineering, pp 89–98
Śliwerski J, Zimmermann T, Zeller A (2005a) HATARI: raising risk awareness. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 107–110
Google Scholar
Śliwerski J, Zimmermann T, Zeller A (2005b) When do changes induce fixes? (on Fridays.) In: Proc. of 2nd international workshop on mining software repositories, pp 24–28
Williams C, Hollingsworth J (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31:466–480
Article Google Scholar
Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco.
MATH Google Scholar

Download references

Acknowledgements

The authors would like to express their thanks to the three anonymous reviewers and the editor for providing insightful and useful suggestions and comments. We also thank the developers of the CRM114 classifier. Without the CRM114, this work could not be conducted. We thank Tatsuya Miyake, Yoshiki Higo, and Katsuro Inoue who implemented a software metrics measurement tool. Finally, the authors also wish to thank the developers of Eclipse who have made the repository of Eclipse available for research.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Kyoto Institute of Technology, Kyoto, Japan
Osamu Mizuno
Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
Hideaki Hata & Tohru Kikuno

Authors

Hideaki Hata
View author publications
You can also search for this author in PubMed Google Scholar
Osamu Mizuno
View author publications
You can also search for this author in PubMed Google Scholar
Tohru Kikuno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osamu Mizuno.

Additional information

Editor: Claes Wohlin

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hata, H., Mizuno, O. & Kikuno, T. Fault-prone module detection using large-scale text features based on spam filtering. Empir Software Eng 15, 147–165 (2010). https://doi.org/10.1007/s10664-009-9117-9

Download citation

Published: 12 September 2009
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10664-009-9117-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault-prone module detection using large-scale text features based on spam filtering

Abstract

Access this article

Similar content being viewed by others

Fault-Prone Byte-Code Detection Using Text Classifier

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

A Study of Filter-Based Feature Selection in Software Fault Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault-prone module detection using large-scale text features based on spam filtering

Abstract

Access this article

Similar content being viewed by others

Fault-Prone Byte-Code Detection Using Text Classifier

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

A Study of Filter-Based Feature Selection in Software Fault Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation