Skip to main content
Log in

Effective and efficient feature selection for large-scale data using Bayes’ theorem

  • Published:
International Journal of Automation and Computing Aims and scope Submit manuscript

Abstract

This paper proposes one method of feature selection by using Bayes’ theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Agrawal, T. Imielinski, A. Swami. Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993.

    Article  Google Scholar 

  2. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), pp. 495–515, AAAI Press/MIT Press, Menlo Park, CA, USA, 1996.

    Google Scholar 

  3. J. Han Y. Fu. Attribute-oriented Induction in Data Mining. Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), pp. 399–421, AAAI Press/MIT Press, Menlo Park, CA, USA, 1996.

    Google Scholar 

  4. J. Han, M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufman, 2005.

  5. H. Liu, H. Motoda. Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic, Boston, USA, 1998.

    MATH  Google Scholar 

  6. D. Pyle. Data Preparation for Data Mining, Morgan Kaufmann, 1999.

  7. A. L. Blum, P. Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, vol. 97, no. 1–2, pp. 245–271, 1997.

    Article  MATH  MathSciNet  Google Scholar 

  8. H. Liu, H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Boston, USA, 1998, 2nd printing, 2001.

    MATH  Google Scholar 

  9. M. Ben-Bassat. Pattern Recognition and Reduction of Dimensionality. Handbook of Statistics II, P. R. Krishnaiah, L. N. Kanal (eds.), North Holland, pp. 773–791, 1982.

  10. A. Jain, D. Zongker. Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, 153–158, 1997.

    Article  Google Scholar 

  11. P. Mitra, C. A. Murthy, S. K. Pal. Unsupervised Feature Selection Using Feature Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301–312, 2002.

    Article  Google Scholar 

  12. W. Siedlecki, J. Sklansky. On Automatic Feature Selection. International Journal of Pattern Recognition and Artificial Intelligence, vol. 2, no. 2, pp. 197–220, 1988.

    Article  Google Scholar 

  13. N. Wyse, R. Dubes, A. K. Jain. A Critical Evaluation of Intrinsic Dimensionality Algorithms. Pattern Recognition in Practice, E. S. Gelsema, L. N. Kanal (eds.), pp. 415–425, Morgan Kaufmann, 1980.

  14. G. H. John, R. Kohavi, K. Pfleger. Irrelevant Feature and the Subset Selection Problem. In Proceedings of the 11th International Conference onMachine Learning, Morgan Kaufmann, New Brunswick, New Jersey, USA, pp. 121–129, 1994.

    Google Scholar 

  15. K. Kira, L. A. Rendell. The Feature Selection Problem: Traditional Methods and a New Algorithm. In Proceedings of the 10th National Conference on Artificial Intelligence, MIT Press, San Jose, California, USA, pp. 129–134, 1992.

    Google Scholar 

  16. R. Kohavi, G. H. John. Wrappers for Feature Subset Selection. Artificial Intelligence, vol. 97, no. 1–2, pp. 273–324, 1997.

    Article  MATH  Google Scholar 

  17. M. Dash, K. Choi, P. Scheuermann, H. Liu. Feature Selection for Clustering — A Filter Solution. In Proceedings of the 2nd International Conference on Data mining, IEEE Computer Society Press, Maebashi City, Japan, pp. 115–122, 2002.

    Chapter  Google Scholar 

  18. M. Dash, H. Liu. Feature Selection for Classification. Intelligent Data Analysis, vol. 1, no. 3, pp. 131–156, 1997.

    Article  Google Scholar 

  19. Y. Kim, W. N. Street, F. Menczer. Feature Selection for Unsupervised Learning via Evolutionary Search. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, Boston, MA, USA, pp. 365–369, 2000.

    Google Scholar 

  20. E. Leopold, J. Kindermann. Text Categorization with Support Vector Machines: How to Represent Texts in Input Space? Machine Learning, vol. 46, no. 1, pp. 423–444, 2002.

    Article  MATH  Google Scholar 

  21. K. Nigam, A. K. Mccallum, S. Thrun, T. Mitchell. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, vol. 39, no. 2, pp. 103–134, 2000.

    Article  MATH  Google Scholar 

  22. Y. Yang, J. O. Pederson. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, Nashville, Tennessee, USA, pp. 412–420, 1997.

    Google Scholar 

  23. Y. Rui, T. S. F. Huang, S. Chang. Image Retrieval: Current Techniques, Promising Directions and Open Issues. Journal of Visual Communication and Image Representation, vol. 10, no. 1, pp. 39–62, 1999.

    Article  Google Scholar 

  24. D. L. Swets, J. J. Weng. Efficient Content-based Image Retrieval Using Automatic Feature Selection. In Proceedings of IEEE International Symposium on Computer Vision, IEEE Computer Society Press, pp. 85–90, 1995.

  25. K. S. Ng, H. Liu. Customer Retention via Data Mining. Artificial Intelligence Review, vol. 14, no. 6, pp. 569–590, 2000.

    Article  MATH  Google Scholar 

  26. W. Lee, S. J. Stolfo, K. W. Mok. Adaptive Intrusion Detection: A Data Mining Approach. Artificial Intelligence Review, vol. 14, no. 6, pp. 533–567, 2000.

    Article  MATH  Google Scholar 

  27. E. Xing, M. I. Jordan, R. M. Karp. Feature Selection for High-dimensional Genomic Microarray Data. In Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Madison, Wisconson, USA, pp. 601–608, 2001.

    Google Scholar 

  28. A. L. Blum, R. L. Rivest. Training a 3-Node Neural Networks is NP-Complete. Neural Networks, vol. 5, no. 1, pp. 117–127, 1992.

    Article  Google Scholar 

  29. P. Langley. Selection of Relevant Features in Machine Learning. In Proceedings of AAAI Fall Symposium on Relevance, AAAI Press, Menlo Park, California, USA, pp. 140–144, 1994.

    Google Scholar 

  30. A. J. Miller. Subset Selection in Regression, 2nd Edition, Chapman & Hall/CRC, 2002.

  31. T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2001.

  32. J. Doak. An Evaluation of Feature Selection Methods and Their Application to Computer Security, Technical Report, Department of Computer Science, University of California at Davis, USA, 1992.

    Google Scholar 

  33. M. Dash, H. Liu. Handling Large Unsupervised Data via Dimensionality Reduction. In Proceedings of SIGMOD Research Issues in Data Mining and Knowledge Discovery Workshop, 1999.

  34. M. Dash, H. Liu, J. Yao. Dimensionality Reduction of Unsupervised Data. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, IEEE Press, Newport Beach, CA, USA, pp. 532–539, 1997.

    Chapter  Google Scholar 

  35. J. G. Dy, C. E. Brodley. Feature Subset Selection and Order Identification for Unsupervised Learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, pp. 247–254, 2000.

    Google Scholar 

  36. L. Talavera. Feature Selection as a Preprocessing Step for Hierarchical Clustering. In Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, Bled, Slovenia, pp. 389–397, 1999.

    Google Scholar 

  37. M. A. Hall. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, Stanford University, USA, pp. 359–366, 2000.

    Google Scholar 

  38. H. Liu, R. Setiono. A Probabilistic Approach to Feature Selection — A Filter Solution. In Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann Publishers, Bari, Italy, pp. 319–327, 1996.

    Google Scholar 

  39. L. Yu, H. Liu. Feature Selection for High-dimensional Data: A Fast Correlation-based Filter Solution. In Proceedings of the 20th International Conference on Machine Learning, AAAI Press, Washington DC, USA, pp. 856–863, 2003.

    Google Scholar 

  40. R. Caruana, D. Freitag. Greedy Attribute Selection. In Proceedings of the 11th International Conference of Machine Learning, Morgan Kaufmann, New Jersey, USA, pp. 28–36, 1994.

    Google Scholar 

  41. S. Das. Filters, Wrappers and a Boosting-based Hybrid for Feature Selection. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, Williams College, Williamstown, MA, USA, pp. 74–81, 2001.

    Google Scholar 

  42. A. Y. Ng. On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples. In Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Madison, Wisconson, USA, pp. 404–412, 1998.

    Google Scholar 

  43. J. R. Quinlan. Induction of Decision Trees. Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.

    Google Scholar 

  44. J. R. Quinlan. C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, 1993.

    Google Scholar 

  45. L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen. Classification and Regression Trees, Wadsworth, Belmont, CA, 1984.

    MATH  Google Scholar 

  46. R. S. Michalski. Pattern Recognition as Rule-guided Inductive Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, no. 4, pp. 349–361, 1980.

    Article  MATH  Google Scholar 

  47. P. M. Narendra, K. Fukunaga. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Transactions on Computers, vol. 26, no. 9, pp. 917–922, 1977.

    Article  MATH  Google Scholar 

  48. P. Pudil, J. Novovicova, J. Kittler. Floating Search Methods in Feature Selection. Pattern Recognition Letters, vol. 15, no. 11, pp. 1119–1125,1994.

    Article  Google Scholar 

  49. P. Somol, P. Pudil, J. Kittler. Fast Branch and Bound Algorithm in Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 7, pp. 900–912, 2000.

    Article  Google Scholar 

  50. J. Casillas, O. Cordon, M. J. Del Jesus, F. Herrera. Genetic Feature Selection in a Fuzzy Rule-based Classification System Learning Process for High-dimensional Problems. Information Sciences, vol. 136, no. 1–4, pp. 135–157, 2001.

    Article  MATH  Google Scholar 

  51. N. Xiong. A Hybrid Approach to Input Selection for Complex Processes. IEEE Transactions on Systems, Man, and Cybernetics — Part A, vol. 32, no. 4, pp. 532–536, 2002.

    Article  Google Scholar 

  52. L. I. Kuncheva, J. C. Bezdek. Nearest Prototype Classification: Clustering, Genetic Algorithms or Random Search. IEEE Transactions on Systems, Man, and Cybernetics — Part C, vol. 28, no. 1, pp. 160–164, 1998.

    Article  Google Scholar 

  53. S. Y. Ho, C. C. Liu, S. Liu. Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm. Pattern Recognition Letters, vol. 23, no. 13, pp. 1495–1503, 2002.

    Article  MATH  Google Scholar 

  54. R. Thawonmas, S. Abe. A Novel Approach to Feature Selection Based on Analysis of Class Regions. IEEE Transactions on Systems, Man, and Cybernetics — Part B, vol. 27, no. 2, pp. 196–207, 1997.

    Article  Google Scholar 

  55. K. Kira, L. A. Rendell. A Practical Approach to Feature Selection. In Proceedings of the 9th International Conference on Machine Learning, Morgan Kaufmann, Aberdeen, Scotland, pp. 249–256, 1992.

    Google Scholar 

  56. I. Kononenko. Estimating Attributes: Analysis and Extensions of RELIEF. In Proceedings of Europe International Conference on Machine Learning, Springer-Verlag, New York, USA, pp. 171–182, 1994.

    Google Scholar 

  57. S. Cost, S. Salzberg. A Weighted Nearest Algorithm with Symbolic Features. Machine Learning, vol. 10, no. 1, pp. 57–78, 1993.

    Google Scholar 

  58. C. Stanfill, D. Waltz. Towards Memory Based Reasoning. Communications of the ACM, vol. 29, no. 12, pp. 1213–1228, 1986.

    Article  Google Scholar 

  59. S. Zhao, E. C. C. Tsang. On Fuzzy Approximation Operators in Attribute Reduction with Fuzzy Rough Sets. Information Sciences, vol. 178, no. 16, pp. 3163–3176, 2008.

    Article  MATH  MathSciNet  Google Scholar 

  60. A. Sharma, K. K. Paliwal. Rotational Linear Discriminate Analysis Technique for Dimensionality Reduction. IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 10, pp. 1336–1347, 2008.

    Article  Google Scholar 

  61. C. L. Blake, C. J. Merz. UCI Repository of Machine Learning Databases, Department of Information and Computer Science, Universitry of California, Irvine, USA, [Online], Available: http://www.ics.uci.edu/mlearn, 1998.

  62. J. Joyce. Bayes’ Theorem. Standford Encyclopedia of Philosophy, E. N. Zalta (ed.), The Metaphysics Research Lab, Stanford University, USA, 2003.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subramanian Appavu Alias Balamurugan.

Additional information

Subramanian Appavu Alias Balamurugan is a Ph. D. candidate at the Department of Information and Communication Engineering, Anna University, Chennai, India. He is also an faculty at Thiagarajar College of Engineering, Madurai, India.

His research interests include data mining and text mining.

Ramasamy Rajaram received the Ph.D. degree from Madurai Kamaraj University, India. He is a professor of Department of Computer Science and Information Technology at Thiagarajar College of Engineering, Madurai, India.

His research interests include data mining and information security.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balamurugan, S.A.A., Rajaram, R. Effective and efficient feature selection for large-scale data using Bayes’ theorem. Int. J. Autom. Comput. 6, 62–71 (2009). https://doi.org/10.1007/s11633-009-0062-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-009-0062-2

Keywords

Navigation