Abstract
Accuracy of machine learners is affected by quality of the data the learners are induced on. In this paper, quality of the training dataset is improved by removing instances detected as noisy by the Partitioning Filter. The fit dataset is first split into subsets, and different base learners are induced on each of these splits. The predictions are combined in such a way that an instance is identified as noisy if it is misclassified by a certain number of base learners. Two versions of the Partitioning Filter are used: Multiple-Partitioning Filter and Iterative-Partitioning Filter. The number of instances removed by the filters is tuned by the voting scheme of the filter and the number of iterations. The primary aim of this study is to compare the predictive performances of the final models built on the filtered and the un-filtered training datasets. A case study of software measurement data of a high assurance software project is performed. It is shown that predictive performances of models built on the filtered fit datasets and evaluated on a noisy test dataset are generally better than those built on the noisy (un-filtered) fit dataset. However, predictive performance based on certain aggressive filters is affected by presence of noise in the evaluation dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Taghi M Khoshgoftaar, Shi Zhong, Vedang Joshi. Noise elimination with ensemble-classifier filtering for software quality estimation. Intelligent Data Analysis, 2005, 9(1): 3–27.
Witten I H, Frank E. Data Mining, Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, 2005.
Khoshgoftaar T M, Seliya N. Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering Journal, December 2003, 8(4): 325–350.
Khoshgoftaar T M, Allen E B. Logistic regression modeling of software quality. International Journal of Reliability, Quality, and Safety Engineering, 1999, 6(4): 303–317.
Zhu X, Wu X, Chen Q. Eliminating class noise in large datasets. In Proc. the 20th Int. Conf. Machine Learning, Washington DC, August 2003, pp.920–927.
Owen D B. Data Quality Control: Theory and Pragmatics. New York: Marcel Dekker, NY, 1990.
Wang R Y, Storey V C, Firth C P. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, August 1995, 7(4): 623–639.
Teng C M. A comparison of noise handling techniques. In Proc. the Int. Florida Artificial Intelligence Research Symposium, 2001, pp.269–273.
Gamberger D, Lavrač N, Džeroski S. Noise elimination in inductive concept learning: A case study in medical diagnosis. In Algorithmic Learning Theory: Proc. the 7th Int. Workshop, Sydney, Australia, LNCS 1160, Springer-Verlag, October, 1996, pp.199–212.
Teng C M. Evaluating noise correction. In Lecture Notes in Artificial Intelligence: Proc. the 6th Pacific Rim Int. Conf. Artificial Intelligence, Melbourne, Australia, Springer-Verlag, 2000, pp.188–198.
Brodley C E, Friedl M A. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11: 131–167.
Rebours P. Partitioning filter approach to noise elimination: An empirical study in software quality classification [Thesis]. Florida Atlantic University, Boca Raton, FL, April 2004, Advised by Khoshgoftaar T M.
Khoshgoftarr T M, Allen E B. A practical classification rule for software quality models. IEEE Trans. Reliability, June 2000, 49(2): 209–216.
Jain R. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, 1991.
Berenson M L, Levine D M, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs: Prentice Hall, NJ, 1983.
Christensen R. Analysis of Variance, Design and Regression. Applied Statistical Methods. 1st Edition, Chapman & Hall, 1996.
Fenton N E, Pfleeger S L. Software Metrics: A Rigorous and Practical Approach. 2nd Edition, Boston: PWS Publishing, MA, 1997.
Quinlan J R. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann, CA, 1993.
Holte R C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 1993, 11: 63–91.
Atkeson C G, Moore A W, Schaal S. Locally weighted learning. Artificial Intelligence Review, 1997, 11(1/5): 11–73.
Cohen W W. Fast effective rule induction. In Proc. the 12th Int. Conf. Machine Learning, Priedities A, Russell S (eds.), Tahoe City: Morgan Kaufmann, CA, July 1995, pp.115–123.
Kolodner J. Case-Based Reasoning. San Mateo, CA: Morgan Kaufmann, 1993.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khoshgoftaar, T.M., Rebours, P. Improving Software Quality Prediction by Noise Filtering Techniques. J Comput Sci Technol 22, 387–396 (2007). https://doi.org/10.1007/s11390-007-9054-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-007-9054-2