Abstract
We present two heuristics for feature selection based on entropy and mutual information criteria, respectively. The mutual-information-based selection algorithm exploiting its submodularity retrieves near-optimal solutions guaranteed by a theoretical lower bound. We demonstrate that these heuristic-based methods can reduce the dimensionality of classification problems by filtering out half of its features in the meantime still improving classification accuracy. Experimental results also show that the mutual-information-based heuristic will most likely collaborate well with classifiers when selecting about a half size of features, while the entropy-based heuristic will help most in the early stage of selection when choosing a relatively small percentage of features. We also demonstrate a remarkable case of feature selection being used in classification on a medical dataset, where it can potentially save half of the cost on the diabetes diagnosis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Manning, C.D., Prabhakar Raghavan, H.S.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Das, S.: Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the eighteenth international conference on machine learning. pp. 74–81. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 856–863. AAAI Press (2003)
Liu, H., Motoda, H.: Computational Methods of Feature Selection (Chapman & Hall/CRC data mining and knowledge discovery series). Chapman & Hall/CRC (2007)
George, E.I.: The variable selection problem. J. Amer. Statist. Assoc. 95, 1304–1308 (1999)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
Ipsen, I.C.F., Kelley, C.T.: Rank-deficient nonlinear least squares problems and subset selection. SIAM J. Numer. Anal. 49, 1244–1266 (2011)
Gu, M., Eisenstat, S.C.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)
Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)
Iwata, S., Fleischer, L., Fujishige, S.: A combinatorial strongly polynomial algorithm for minimizing submodular functions. J. ACM 48, 761–777 (2001)
Krause, A., Guestrin, C.: Near-optimal nonmyopic value of information in graphical models. In: Proceedings of the Twenty-First Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-05), pp. 324–331. AUAI Press, Arlington, Virginia (2005)
Krause, A., McMahan, B., Guestrin, C., Gupta, A.: Robust submodular observation selection. J. Mach. Learn. Res. 9, 2761–2801 (2008)
Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts (2006)
Caselton, W., Zidek, J.: Optimal monitoring network designs. Stat. Prob. Lett. 2(4), 223–227 (1984)
Nemhauser, G., Wolsey, L., Fisher, M.: An analysis of the approximations for maximizing submodular set functions. Math. Program. 14, 265–294 (1978)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Bischof, C.H., Quintana-OrtÃ, G.: Computing rank-revealing QR factorizations of dense matrices. ACM Trans. Math. Softw. 24, 226–253 (1998)
Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2010)
Chang, C.C., Lin, C.J.: LIBSVM data: classification, regression, and multi-label. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
University of Waikato, M.L.G. at: Weka 3: data mining software in java. http://www.cs.waikato.ac.nz/~ml/weka/index.html
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, Third edn. Morgan Kaufmann, (2011)
Acknowledgement
This work was generously supported by the following funds: Hainan University’s Scientific Research Start-Up Fund; Ministry of Education of China’s Scientific Research Fund for the Returned Overseas Chinese Scholars; Hainan Province Natural Science Fund No. 20156243; China’s Natural Science Fund Nos. 11401146, 11471135, 61462022, 61562017, 61562018, 61562019; Hainan Province’s Major Science and Technology Project Grant No. ZDKJ2016015; Hainan Province’s Key Research and Development Program Grant Nos. ZDYF2017010 and ZDYF2017128. This work was also supported by the State Key Laboratory of Marine Resource Utilization in the South China Sea, Hainan University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd
About this paper
Cite this paper
Qi, Q., Li, N., Li, W. (2017). Exploration of Heuristic-Based Feature Selection on Classification Problems. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_9
Download citation
DOI: https://doi.org/10.1007/978-981-10-6442-5_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)