Effective and efficient feature selection for large-scale data using Bayes’ theorem

Balamurugan, Subramanian Appavu Alias; Rajaram, Ramasamy

doi:10.1007/s11633-009-0062-2

Effective and efficient feature selection for large-scale data using Bayes’ theorem

Published: 20 January 2009

Volume 6, pages 62–71, (2009)
Cite this article

International Journal of Automation and Computing Aims and scope Submit manuscript

Subramanian Appavu Alias Balamurugan¹ &
Ramasamy Rajaram²

260 Accesses
16 Citations
Explore all metrics

Abstract

This paper proposes one method of feature selection by using Bayes’ theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why Feature Selection in Data Mining Is Prominent? A Survey

Feature Selection Method Based on Classification Performance Score and P-value

Attribute Selection Based on Correlation Analysis

References

R. Agrawal, T. Imielinski, A. Swami. Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993.
Article Google Scholar
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), pp. 495–515, AAAI Press/MIT Press, Menlo Park, CA, USA, 1996.
Google Scholar
J. Han Y. Fu. Attribute-oriented Induction in Data Mining. Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), pp. 399–421, AAAI Press/MIT Press, Menlo Park, CA, USA, 1996.
Google Scholar
J. Han, M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufman, 2005.
H. Liu, H. Motoda. Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic, Boston, USA, 1998.
MATH Google Scholar
D. Pyle. Data Preparation for Data Mining, Morgan Kaufmann, 1999.
A. L. Blum, P. Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, vol. 97, no. 1–2, pp. 245–271, 1997.
Article MATH MathSciNet Google Scholar
H. Liu, H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Boston, USA, 1998, 2nd printing, 2001.
MATH Google Scholar
M. Ben-Bassat. Pattern Recognition and Reduction of Dimensionality. Handbook of Statistics II, P. R. Krishnaiah, L. N. Kanal (eds.), North Holland, pp. 773–791, 1982.
A. Jain, D. Zongker. Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, 153–158, 1997.
Article Google Scholar
P. Mitra, C. A. Murthy, S. K. Pal. Unsupervised Feature Selection Using Feature Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301–312, 2002.
Article Google Scholar
W. Siedlecki, J. Sklansky. On Automatic Feature Selection. International Journal of Pattern Recognition and Artificial Intelligence, vol. 2, no. 2, pp. 197–220, 1988.
Article Google Scholar
N. Wyse, R. Dubes, A. K. Jain. A Critical Evaluation of Intrinsic Dimensionality Algorithms. Pattern Recognition in Practice, E. S. Gelsema, L. N. Kanal (eds.), pp. 415–425, Morgan Kaufmann, 1980.
G. H. John, R. Kohavi, K. Pfleger. Irrelevant Feature and the Subset Selection Problem. In Proceedings of the 11th International Conference onMachine Learning, Morgan Kaufmann, New Brunswick, New Jersey, USA, pp. 121–129, 1994.
Google Scholar
K. Kira, L. A. Rendell. The Feature Selection Problem: Traditional Methods and a New Algorithm. In Proceedings of the 10th National Conference on Artificial Intelligence, MIT Press, San Jose, California, USA, pp. 129–134, 1992.
Google Scholar
R. Kohavi, G. H. John. Wrappers for Feature Subset Selection. Artificial Intelligence, vol. 97, no. 1–2, pp. 273–324, 1997.
Article MATH Google Scholar
M. Dash, K. Choi, P. Scheuermann, H. Liu. Feature Selection for Clustering — A Filter Solution. In Proceedings of the 2nd International Conference on Data mining, IEEE Computer Society Press, Maebashi City, Japan, pp. 115–122, 2002.
Chapter Google Scholar
M. Dash, H. Liu. Feature Selection for Classification. Intelligent Data Analysis, vol. 1, no. 3, pp. 131–156, 1997.
Article Google Scholar
Y. Kim, W. N. Street, F. Menczer. Feature Selection for Unsupervised Learning via Evolutionary Search. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, Boston, MA, USA, pp. 365–369, 2000.
Google Scholar
E. Leopold, J. Kindermann. Text Categorization with Support Vector Machines: How to Represent Texts in Input Space? Machine Learning, vol. 46, no. 1, pp. 423–444, 2002.
Article MATH Google Scholar
K. Nigam, A. K. Mccallum, S. Thrun, T. Mitchell. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, vol. 39, no. 2, pp. 103–134, 2000.
Article MATH Google Scholar
Y. Yang, J. O. Pederson. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, Nashville, Tennessee, USA, pp. 412–420, 1997.
Google Scholar
Y. Rui, T. S. F. Huang, S. Chang. Image Retrieval: Current Techniques, Promising Directions and Open Issues. Journal of Visual Communication and Image Representation, vol. 10, no. 1, pp. 39–62, 1999.
Article Google Scholar
D. L. Swets, J. J. Weng. Efficient Content-based Image Retrieval Using Automatic Feature Selection. In Proceedings of IEEE International Symposium on Computer Vision, IEEE Computer Society Press, pp. 85–90, 1995.
K. S. Ng, H. Liu. Customer Retention via Data Mining. Artificial Intelligence Review, vol. 14, no. 6, pp. 569–590, 2000.
Article MATH Google Scholar
W. Lee, S. J. Stolfo, K. W. Mok. Adaptive Intrusion Detection: A Data Mining Approach. Artificial Intelligence Review, vol. 14, no. 6, pp. 533–567, 2000.
Article MATH Google Scholar
E. Xing, M. I. Jordan, R. M. Karp. Feature Selection for High-dimensional Genomic Microarray Data. In Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Madison, Wisconson, USA, pp. 601–608, 2001.
Google Scholar
A. L. Blum, R. L. Rivest. Training a 3-Node Neural Networks is NP-Complete. Neural Networks, vol. 5, no. 1, pp. 117–127, 1992.
Article Google Scholar
P. Langley. Selection of Relevant Features in Machine Learning. In Proceedings of AAAI Fall Symposium on Relevance, AAAI Press, Menlo Park, California, USA, pp. 140–144, 1994.
Google Scholar
A. J. Miller. Subset Selection in Regression, 2nd Edition, Chapman & Hall/CRC, 2002.
T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2001.
J. Doak. An Evaluation of Feature Selection Methods and Their Application to Computer Security, Technical Report, Department of Computer Science, University of California at Davis, USA, 1992.
Google Scholar
M. Dash, H. Liu. Handling Large Unsupervised Data via Dimensionality Reduction. In Proceedings of SIGMOD Research Issues in Data Mining and Knowledge Discovery Workshop, 1999.
M. Dash, H. Liu, J. Yao. Dimensionality Reduction of Unsupervised Data. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, IEEE Press, Newport Beach, CA, USA, pp. 532–539, 1997.
Chapter Google Scholar
J. G. Dy, C. E. Brodley. Feature Subset Selection and Order Identification for Unsupervised Learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, pp. 247–254, 2000.
Google Scholar
L. Talavera. Feature Selection as a Preprocessing Step for Hierarchical Clustering. In Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, Bled, Slovenia, pp. 389–397, 1999.
Google Scholar
M. A. Hall. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, Stanford University, USA, pp. 359–366, 2000.
Google Scholar
H. Liu, R. Setiono. A Probabilistic Approach to Feature Selection — A Filter Solution. In Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann Publishers, Bari, Italy, pp. 319–327, 1996.
Google Scholar
L. Yu, H. Liu. Feature Selection for High-dimensional Data: A Fast Correlation-based Filter Solution. In Proceedings of the 20th International Conference on Machine Learning, AAAI Press, Washington DC, USA, pp. 856–863, 2003.
Google Scholar
R. Caruana, D. Freitag. Greedy Attribute Selection. In Proceedings of the 11th International Conference of Machine Learning, Morgan Kaufmann, New Jersey, USA, pp. 28–36, 1994.
Google Scholar
S. Das. Filters, Wrappers and a Boosting-based Hybrid for Feature Selection. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, Williams College, Williamstown, MA, USA, pp. 74–81, 2001.
Google Scholar
A. Y. Ng. On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples. In Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Madison, Wisconson, USA, pp. 404–412, 1998.
Google Scholar
J. R. Quinlan. Induction of Decision Trees. Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, 1993.
Google Scholar
L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen. Classification and Regression Trees, Wadsworth, Belmont, CA, 1984.
MATH Google Scholar
R. S. Michalski. Pattern Recognition as Rule-guided Inductive Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, no. 4, pp. 349–361, 1980.
Article MATH Google Scholar
P. M. Narendra, K. Fukunaga. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Transactions on Computers, vol. 26, no. 9, pp. 917–922, 1977.
Article MATH Google Scholar
P. Pudil, J. Novovicova, J. Kittler. Floating Search Methods in Feature Selection. Pattern Recognition Letters, vol. 15, no. 11, pp. 1119–1125,1994.
Article Google Scholar
P. Somol, P. Pudil, J. Kittler. Fast Branch and Bound Algorithm in Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 7, pp. 900–912, 2000.
Article Google Scholar
J. Casillas, O. Cordon, M. J. Del Jesus, F. Herrera. Genetic Feature Selection in a Fuzzy Rule-based Classification System Learning Process for High-dimensional Problems. Information Sciences, vol. 136, no. 1–4, pp. 135–157, 2001.
Article MATH Google Scholar
N. Xiong. A Hybrid Approach to Input Selection for Complex Processes. IEEE Transactions on Systems, Man, and Cybernetics — Part A, vol. 32, no. 4, pp. 532–536, 2002.
Article Google Scholar
L. I. Kuncheva, J. C. Bezdek. Nearest Prototype Classification: Clustering, Genetic Algorithms or Random Search. IEEE Transactions on Systems, Man, and Cybernetics — Part C, vol. 28, no. 1, pp. 160–164, 1998.
Article Google Scholar
S. Y. Ho, C. C. Liu, S. Liu. Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm. Pattern Recognition Letters, vol. 23, no. 13, pp. 1495–1503, 2002.
Article MATH Google Scholar
R. Thawonmas, S. Abe. A Novel Approach to Feature Selection Based on Analysis of Class Regions. IEEE Transactions on Systems, Man, and Cybernetics — Part B, vol. 27, no. 2, pp. 196–207, 1997.
Article Google Scholar
K. Kira, L. A. Rendell. A Practical Approach to Feature Selection. In Proceedings of the 9th International Conference on Machine Learning, Morgan Kaufmann, Aberdeen, Scotland, pp. 249–256, 1992.
Google Scholar
I. Kononenko. Estimating Attributes: Analysis and Extensions of RELIEF. In Proceedings of Europe International Conference on Machine Learning, Springer-Verlag, New York, USA, pp. 171–182, 1994.
Google Scholar
S. Cost, S. Salzberg. A Weighted Nearest Algorithm with Symbolic Features. Machine Learning, vol. 10, no. 1, pp. 57–78, 1993.
Google Scholar
C. Stanfill, D. Waltz. Towards Memory Based Reasoning. Communications of the ACM, vol. 29, no. 12, pp. 1213–1228, 1986.
Article Google Scholar
S. Zhao, E. C. C. Tsang. On Fuzzy Approximation Operators in Attribute Reduction with Fuzzy Rough Sets. Information Sciences, vol. 178, no. 16, pp. 3163–3176, 2008.
Article MATH MathSciNet Google Scholar
A. Sharma, K. K. Paliwal. Rotational Linear Discriminate Analysis Technique for Dimensionality Reduction. IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 10, pp. 1336–1347, 2008.
Article Google Scholar
C. L. Blake, C. J. Merz. UCI Repository of Machine Learning Databases, Department of Information and Computer Science, Universitry of California, Irvine, USA, [Online], Available: http://www.ics.uci.edu/mlearn, 1998.
J. Joyce. Bayes’ Theorem. Standford Encyclopedia of Philosophy, E. N. Zalta (ed.), The Metaphysics Research Lab, Stanford University, USA, 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Thiagarajar College of Engineering, Madurai, India
Subramanian Appavu Alias Balamurugan
Department of Computer Science and Information Technology, Thiagarajar College of Engineering, Madurai, India
Ramasamy Rajaram

Authors

Subramanian Appavu Alias Balamurugan
View author publications
You can also search for this author in PubMed Google Scholar
Ramasamy Rajaram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subramanian Appavu Alias Balamurugan.

Additional information

Subramanian Appavu Alias Balamurugan is a Ph. D. candidate at the Department of Information and Communication Engineering, Anna University, Chennai, India. He is also an faculty at Thiagarajar College of Engineering, Madurai, India.

His research interests include data mining and text mining.

Ramasamy Rajaram received the Ph.D. degree from Madurai Kamaraj University, India. He is a professor of Department of Computer Science and Information Technology at Thiagarajar College of Engineering, Madurai, India.

His research interests include data mining and information security.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balamurugan, S.A.A., Rajaram, R. Effective and efficient feature selection for large-scale data using Bayes’ theorem. Int. J. Autom. Comput. 6, 62–71 (2009). https://doi.org/10.1007/s11633-009-0062-2

Download citation

Received: 21 April 2008
Revised: 06 October 2008
Published: 20 January 2009
Issue Date: February 2009
DOI: https://doi.org/10.1007/s11633-009-0062-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective and efficient feature selection for large-scale data using Bayes’ theorem

Abstract

Access this article

Similar content being viewed by others

Why Feature Selection in Data Mining Is Prominent? A Survey

Feature Selection Method Based on Classification Performance Score and P-value

Attribute Selection Based on Correlation Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective and efficient feature selection for large-scale data using Bayes’ theorem

Abstract

Access this article

Similar content being viewed by others

Why Feature Selection in Data Mining Is Prominent? A Survey

Feature Selection Method Based on Classification Performance Score and P-value

Attribute Selection Based on Correlation Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation