Abstract
This paper presents a novel multiclass feature selection algorithm based on weighted conditional entropy, also referred to as uncertainty. The goal of the proposed algorithm is to select a feature subset such that, for each feature sample, there exists a feature that has a low uncertainty score in the selected feature subset. Features are first quantized into different bins. The proposed feature selection method first computes an uncertainty vector from weighted conditional entropy. Lower the uncertainty score for a class, better is the separability of the samples in that class. Next, an iterative feature selection method selects a feature in each iteration by (1) computing the minimum uncertainty score for each feature sample for all possible feature subset candidates, (2) computing the average minimum uncertainty score across all feature samples, and (3) selecting the feature that achieves the minimum of the mean of the minimum uncertainty score. The experimental results show that the proposed algorithm outperforms mRMR and achieves lower misclassification rates using various types of publicly available datasets. In most cases, the number of features necessary for a specified misclassification error is less than that required by traditional methods. For all datasets, the misclassification error is reduced by 5∼25% on average, compared to a traditional method.
Similar content being viewed by others
References
Allwein, E.L., Schapire, R.E., Singer, Y. (2000). Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1(Dec), 113–141.
Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L. (2013). A public domain dataset for human activity recognition using smartphones. In ESANN.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.
Bermingham, M.L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A.F., Wilson, J.F., Agakov, F., Navarro, P., Haley, C.S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 5, 10312.
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A. (1984). Classification and regression trees. Boca Raton: CRC Press.
Brown, G. (2009). A new perspective for information theoretic feature selection. In AISTATS, pp. 49–56.
Chen, Y.W., & Lin, C.J. (2006). Combining SVMs with various feature selection strategies. Feature Extraction, 207, 315–324.
Cover, T.M., & Thomas, J.A. (2012). Elements of information theory. New York: Wiley.
Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.
Devijver, P.A., & Kittler, J. (1982). Pattern recognition: a statistical approach. New Jersey: Prentice hall.
Dietterich, T.G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37.
Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5(Nov), 1531–1555.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(Mar), 1289–1305.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157–1182.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3), 389–422.
Hall, M.A. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the 17th international conference on machine learning, pp. 359–366.
Henze, N., & Penrose, M.D. (1999). On the multivariate runs test. Annals of Statistics, pp. 290–298.
Hou, Y., Zhang, P., Yan, T., Li, W., Song, D. (2010). Beyond redundancies: a metric-invariant method for unsupervised feature selection. IEEE Transactions on Knowledge and Data Engineering, 22(3), 348–364.
Huang, N.E., Shen, Z., Long, S.R., Wu, M.C., Shih, H.H., Zheng, Q., Yen, N.C., Tung, C.C., Liu, H.H. (1998). The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. In Proceedings of the Royal Society of London a: mathematical, physical and engineering sciences, vol. 454, pp. 903–995. The Royal Society.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An introduction to statistical learning, vol. 6. Berlin: Springer.
Jolliffe, I. (2002). Principal component analysis. Wiley Online Library.
Kariwala, V., Ye, L., Cao, Y. (2013). Branch and bound method for regression-based controlled variable selection. Computers and Chemical Engineering, 54, 1–7.
Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273–324.
Kwak, N., & Choi, C.H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159.
Langley, P. (1994). Selection of relevant features in machine learning. In Proceedings of the AAAI fall symposium on relevance, vol. 184, pp. 245–271.
Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Lilliefors, H.W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318), 399–402.
Lin, D., & Tang, X. (2006). Conditional infomax learning: an integrated framework for feature extraction and fusion. In European conference on computer vision, pp. 68–82. Springer.
Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining, vol. 454. Berlin: Springer.
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502.
Maji, P., & Pal, S.K. (2010). Feature selection using f-information measures in fuzzy approximation spaces. IEEE Transactions on Knowledge and Data Engineering, 22(6), 854–867.
Otto. (2014). Otto group product classification challenge. https://www.kaggle.com/.
Paschke, F., Bayer, C., Bator, M., Mönks, U., Dicks, A., Enge-Rosenblatt, O., Lohweg, V. (2013). Sensorlose zustandsüberwachung an synchronmotoren. In Proceedings. 23. Workshop computational intelligence, dortmund, 5.-6. December 2013, p. 211. KIT Scientific Publishing.
Peng, H., Long, F., Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Pudil, P., Novovičová, J., Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(11), 1119–1125.
Qu, G., Hariri, S., Yousif, M. (2005). A new dependency and correlation analysis for features. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1199–1207.
Reyes-Ortiz, J.L., Oneto, L., Ghio, A., Samá, A., Anguita, D., Parra, X. (2014). Human activity recognition on smartphones with awareness of basic activities and postural transitions. In International conference on artificial neural networks, pp. 177–184. Springer.
Reyes-Ortiz, J.L., Oneto, L., Samà, A., Parra, X., Anguita, D. (2016). Transition-aware human activity recognition using smartphones. Neurocomputing, 171, 754–767.
Sayood, K. (2012). Introduction to data compression. Burlington: Morgan Kaufmann.
Siedlecki, W., & Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence, 2(02), 197–220.
Somol, P., Pudil, P., Kittler, J. (2004). Fast branch & bound algorithms for optimal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(7), 900–912.
Theodoridis, S., & Koutroumbas, K. (2008). Pattern recognition. Cambridge: Academic Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
UCI. (2014). Forest cover type prediction. https://www.kaggle.com/.
Vergara, J.R., & Estévez, P.A. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1), 175–186.
Vidal-Naquet, M., & Ullman, S. (2003). Object recognition with informative features and linear classification. In ICCV, vol. 3, p. 281.
Wang, D., Nie, F., Huang, H. (2015). Feature selection via global redundancy minimization. IEEE Transactions on Knowledge and Data Engineering, 27(10), 2743–2755.
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M. (2003). Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3(Mar), 1439–1461.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V. (2000). Feature selection for SVMs advances in neural information processing systems.
Yang, H.H., & Moody, J.E. (1999). Data visualization and feature selection: new algorithms for nongaussian data. In NIPS, vol. 99, pp. 687–693. Citeseer.
Yang, S.H., & Hu, B.G. (2012). Discriminative feature selection by nonparametric bayes error minimization. IEEE Transactions on Knowledge and Data Engineering, 24(8), 1422–1434.
Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5(Oct), 1205–1224.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Z., Parhi, K.K. M3U: Minimum Mean Minimum Uncertainty Feature Selection for Multiclass Classification. J Sign Process Syst 92, 9–22 (2020). https://doi.org/10.1007/s11265-019-1443-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-1443-6