Skip to main content
Log in

The pairwise attribute noise detection algorithm

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD conference on management of data, ACM Press, Dallas, TX

    Google Scholar 

  2. Bobrowski M, Marre M, Yankelevich D. A software engineering view of data quality. Available at www.citeseer.ist.psu.edu/277636.html$

  3. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11: 131–167

    MATH  Google Scholar 

  4. Clark P, Niblett T (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the 5th European working session on learning, pp 151–163

  5. Dunagan JD (2002). A geometic theory of outliers and perturbation. Ph.D. Dissertation. Available at http://research.microsoft.com/∼jdunagan/thesis.pdf

  6. Fenton NE, Pfleeger SL (1997) Software metrics: a rigorous and practical approach, 2nd edn. PWS Publishing Company: ITP, Boston, MA

  7. Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of 18th international conference on data engineering, IEEE Computer Society, San Jose, CA

    Google Scholar 

  8. Gamberger D, Lavrac N, Dzeroski S (1999) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: Proceedings of the 7th international workshop on algorithmic learning theory, Springer, Berlin Heidelberg Ney York, pp 199–212

    Google Scholar 

  9. Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Mateo, California, pp 143–153

    Google Scholar 

  10. Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 127–138. citeseer.ist.psu.edu/stolfo95mergepurge.html

  11. Hernandez MA, Stolfo, SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37

    Article  Google Scholar 

  12. Khoshgoftaar TM, Allen EB (1998) Classifcation of fault-prone software modules: prior probabilities, costs and model evaluation. Empiric Software Eng 3: 275–298

    Article  Google Scholar 

  13. Khoshgoftaar TM, Bullard LA, Gao K (2003) Detecting outliers using rule-based modeling for improving CBR-based software quality classification models. In: Ashley KD, Bridge DG (eds) Proceedings of the 16th international conference on case-based reasoning. LNAI, vol 1689. Springer-Verlag, Berlin Heidelberg New York, pp 216–230

  14. Khoshgoftaar TM, Rebours P (2004) Generarting multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the IEEE international conference on information reuse and integration, IEEE Systems, Man and Cybernetics Society, Las Vegas, NV, USA, pp 369–375

    Google Scholar 

  15. Khoshgoftaar TM, Seliya N (2004) The necessity of assuring quality in software measurement data. In: Proceedings of 10th international software metrics symposium, IEEE Computer Society, Chicago, IL, pp 119–130

    Google Scholar 

  16. Khoshgoftaar TM, Seliya N, Gao K (2005) Detecting noisy instances with the rule-based classification model. Intell Data Anal 9(4):347–364

    Google Scholar 

  17. Khoshgoftaar TM, Zhong S, Joshi V (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1): 3–27

    Google Scholar 

  18. Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation. In Proceedings of knowledge discovery and data mining. American Association for Artificial Intelligence, Newport Beach, CA, pp 219–222

  19. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large databases, New York, NY, pp 392–403

  20. Marcus A, Maletic J, Lin K-I (2001) Ordinal association rules for error identification in datasets. In: Proceedings of 10th international conference on information and knowledge management. ACM Press, Atlanta, GA, pp 589–591

    Google Scholar 

  21. Murphy, PM, Aha DW (1998) UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Science. http://www.ics.uci.edu/∼mlearn/MLRepository.html

  22. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California

    Google Scholar 

  23. Ramasway S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 427–438

  24. SAS Institute (2004) SAS/STAT user's guide. SAS Institute Inc

  25. Shekhar S, Lu C, Zhang P (2002) Detecting graph-based spatial outliers. Intell Data Anal 6: 451–458

    MATH  Google Scholar 

  26. Strong D, Lee Y, Wang R (1997) Data quality in context. Commun ACM 40(5): 103–110

    Article  Google Scholar 

  27. Teng CM (1999) Correcting noisy data. In: Proceedings of 6th international conference machine learning (ICML 99). Morgan Kaufmann, San Mateo, California, pp 239–248

    Google Scholar 

  28. Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings of 8th European conference on principles and practice of knowledge discovery in databases, Pisa, Italy

  29. Zhong S, Khoshgoftaar TM, Seliya N (2004) Analyzing software measurement data with clustering techniques. IEEE Intell Syst, pp 22–29

  30. Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4): 177–210

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taghi M. Khoshgoftaar.

Additional information

Jason Van Hulse is a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include data mining and knowledge discovery, machine learning, computational intelligence and statistics. He is a student member of the IEEE and IEEE Computer Society. He received the M.A. degree in mathematics from Stony Brook University in 2000, and is currently Director, Decision Science at First Data Corporation.

Taghi M. Khoshgoftaar is a professor at the Department of Computer Science and Engineering, Florida Atlantic University, and the director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these subjects. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the IEEE, the IEEE Computer Society, and IEEE Reliability Society. He served as the program chair and general chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005, respectively. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems.

Haiying Huang received the M.S. degree in computer engineeringfrom Florida Atlantic University, Boca Raton, Florida, USA, in 2002. She is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. Her research interests include software engineering, computational intelligence, data mining, software measurement, software reliability, and quality engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Van Hulse, J.D., Khoshgoftaar, T.M. & Huang, H. The pairwise attribute noise detection algorithm. Knowl Inf Syst 11, 171–190 (2007). https://doi.org/10.1007/s10115-006-0022-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0022-x

Keywords

Navigation