The pairwise attribute noise detection algorithm

Van Hulse, Jason D.; Khoshgoftaar, Taghi M.; Huang, Haiying

doi:10.1007/s10115-006-0022-x

The pairwise attribute noise detection algorithm

Regular Paper
Published: 08 April 2006

Volume 11, pages 171–190, (2007)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jason D. Van Hulse¹,
Taghi M. Khoshgoftaar¹ &
Haiying Huang¹

310 Accesses
56 Citations
Explore all metrics

Abstract

Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ODRA: an outlier detection algorithm based on relevant attribute analysis method

Article 13 June 2020

ODCA: An Outlier Detection Approach to Deal with Correlated Attributes

The Concept of α-Outliers in Structured Data Situations

References

Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD conference on management of data, ACM Press, Dallas, TX
Google Scholar
Bobrowski M, Marre M, Yankelevich D. A software engineering view of data quality. Available at www.citeseer.ist.psu.edu/277636.html$
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11: 131–167
MATH Google Scholar
Clark P, Niblett T (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the 5th European working session on learning, pp 151–163
Dunagan JD (2002). A geometic theory of outliers and perturbation. Ph.D. Dissertation. Available at http://research.microsoft.com/∼jdunagan/thesis.pdf
Fenton NE, Pfleeger SL (1997) Software metrics: a rigorous and practical approach, 2nd edn. PWS Publishing Company: ITP, Boston, MA
Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of 18th international conference on data engineering, IEEE Computer Society, San Jose, CA
Google Scholar
Gamberger D, Lavrac N, Dzeroski S (1999) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: Proceedings of the 7th international workshop on algorithmic learning theory, Springer, Berlin Heidelberg Ney York, pp 199–212
Google Scholar
Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Mateo, California, pp 143–153
Google Scholar
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 127–138. citeseer.ist.psu.edu/stolfo95mergepurge.html
Hernandez MA, Stolfo, SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37
Article Google Scholar
Khoshgoftaar TM, Allen EB (1998) Classifcation of fault-prone software modules: prior probabilities, costs and model evaluation. Empiric Software Eng 3: 275–298
Article Google Scholar
Khoshgoftaar TM, Bullard LA, Gao K (2003) Detecting outliers using rule-based modeling for improving CBR-based software quality classification models. In: Ashley KD, Bridge DG (eds) Proceedings of the 16th international conference on case-based reasoning. LNAI, vol 1689. Springer-Verlag, Berlin Heidelberg New York, pp 216–230
Khoshgoftaar TM, Rebours P (2004) Generarting multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the IEEE international conference on information reuse and integration, IEEE Systems, Man and Cybernetics Society, Las Vegas, NV, USA, pp 369–375
Google Scholar
Khoshgoftaar TM, Seliya N (2004) The necessity of assuring quality in software measurement data. In: Proceedings of 10th international software metrics symposium, IEEE Computer Society, Chicago, IL, pp 119–130
Google Scholar
Khoshgoftaar TM, Seliya N, Gao K (2005) Detecting noisy instances with the rule-based classification model. Intell Data Anal 9(4):347–364
Google Scholar
Khoshgoftaar TM, Zhong S, Joshi V (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1): 3–27
Google Scholar
Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation. In Proceedings of knowledge discovery and data mining. American Association for Artificial Intelligence, Newport Beach, CA, pp 219–222
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large databases, New York, NY, pp 392–403
Marcus A, Maletic J, Lin K-I (2001) Ordinal association rules for error identification in datasets. In: Proceedings of 10th international conference on information and knowledge management. ACM Press, Atlanta, GA, pp 589–591
Google Scholar
Murphy, PM, Aha DW (1998) UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Science. http://www.ics.uci.edu/∼mlearn/MLRepository.html
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California
Google Scholar
Ramasway S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 427–438
SAS Institute (2004) SAS/STAT user's guide. SAS Institute Inc
Shekhar S, Lu C, Zhang P (2002) Detecting graph-based spatial outliers. Intell Data Anal 6: 451–458
MATH Google Scholar
Strong D, Lee Y, Wang R (1997) Data quality in context. Commun ACM 40(5): 103–110
Article Google Scholar
Teng CM (1999) Correcting noisy data. In: Proceedings of 6th international conference machine learning (ICML 99). Morgan Kaufmann, San Mateo, California, pp 239–248
Google Scholar
Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings of 8th European conference on principles and practice of knowledge discovery in databases, Pisa, Italy
Zhong S, Khoshgoftaar TM, Seliya N (2004) Analyzing software measurement data with clustering techniques. IEEE Intell Syst, pp 22–29
Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4): 177–210
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, 33431, USA
Jason D. Van Hulse, Taghi M. Khoshgoftaar & Haiying Huang

Authors

Jason D. Van Hulse
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Haiying Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taghi M. Khoshgoftaar.

Additional information

Jason Van Hulse is a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include data mining and knowledge discovery, machine learning, computational intelligence and statistics. He is a student member of the IEEE and IEEE Computer Society. He received the M.A. degree in mathematics from Stony Brook University in 2000, and is currently Director, Decision Science at First Data Corporation.

Taghi M. Khoshgoftaar is a professor at the Department of Computer Science and Engineering, Florida Atlantic University, and the director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these subjects. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the IEEE, the IEEE Computer Society, and IEEE Reliability Society. He served as the program chair and general chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005, respectively. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems.

Haiying Huang received the M.S. degree in computer engineeringfrom Florida Atlantic University, Boca Raton, Florida, USA, in 2002. She is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. Her research interests include software engineering, computational intelligence, data mining, software measurement, software reliability, and quality engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Van Hulse, J.D., Khoshgoftaar, T.M. & Huang, H. The pairwise attribute noise detection algorithm. Knowl Inf Syst 11, 171–190 (2007). https://doi.org/10.1007/s10115-006-0022-x

Download citation

Received: 26 April 2005
Revised: 19 September 2005
Accepted: 14 January 2006
Published: 08 April 2006
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10115-006-0022-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The pairwise attribute noise detection algorithm

Abstract

Access this article

Similar content being viewed by others

ODRA: an outlier detection algorithm based on relevant attribute analysis method

ODCA: An Outlier Detection Approach to Deal with Correlated Attributes

The Concept of α-Outliers in Structured Data Situations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The pairwise attribute noise detection algorithm

Abstract

Access this article

Similar content being viewed by others

ODRA: an outlier detection algorithm based on relevant attribute analysis method

ODCA: An Outlier Detection Approach to Deal with Correlated Attributes

The Concept of α-Outliers in Structured Data Situations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation