Exploration of Heuristic-Based Feature Selection on Classification Problems

Qi, Qi; Li, Ni; Li, Weimin

doi:10.1007/978-981-10-6442-5_9

Qi Qi¹²,
Ni Li¹³ &
Weimin Li¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 729))

Included in the following conference series:

International Symposium on Parallel Architecture, Algorithm and Programming

1329 Accesses
1 Citations

Abstract

We present two heuristics for feature selection based on entropy and mutual information criteria, respectively. The mutual-information-based selection algorithm exploiting its submodularity retrieves near-optimal solutions guaranteed by a theoretical lower bound. We demonstrate that these heuristic-based methods can reduce the dimensionality of classification problems by filtering out half of its features in the meantime still improving classification accuracy. Experimental results also show that the mutual-information-based heuristic will most likely collaborate well with classifiers when selecting about a half size of features, while the entropy-based heuristic will help most in the early stage of selection when choosing a relatively small percentage of features. We also demonstrate a remarkable case of feature selection being used in classification on a medical dataset, where it can potentially save half of the cost on the diabetes diagnosis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Manning, C.D., Prabhakar Raghavan, H.S.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
MATH Google Scholar
Das, S.: Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the eighteenth international conference on machine learning. pp. 74–81. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 856–863. AAAI Press (2003)
Google Scholar
Liu, H., Motoda, H.: Computational Methods of Feature Selection (Chapman & Hall/CRC data mining and knowledge discovery series). Chapman & Hall/CRC (2007)
Google Scholar
George, E.I.: The variable selection problem. J. Amer. Statist. Assoc. 95, 1304–1308 (1999)
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
Book MATH Google Scholar
Ipsen, I.C.F., Kelley, C.T.: Rank-deficient nonlinear least squares problems and subset selection. SIAM J. Numer. Anal. 49, 1244–1266 (2011)
Google Scholar
Gu, M., Eisenstat, S.C.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)
Article MathSciNet MATH Google Scholar
Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)
MATH Google Scholar
Iwata, S., Fleischer, L., Fujishige, S.: A combinatorial strongly polynomial algorithm for minimizing submodular functions. J. ACM 48, 761–777 (2001)
Article MathSciNet MATH Google Scholar
Krause, A., Guestrin, C.: Near-optimal nonmyopic value of information in graphical models. In: Proceedings of the Twenty-First Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-05), pp. 324–331. AUAI Press, Arlington, Virginia (2005)
Google Scholar
Krause, A., McMahan, B., Guestrin, C., Gupta, A.: Robust submodular observation selection. J. Mach. Learn. Res. 9, 2761–2801 (2008)
MATH Google Scholar
Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)
MATH Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts (2006)
MATH Google Scholar
Caselton, W., Zidek, J.: Optimal monitoring network designs. Stat. Prob. Lett. 2(4), 223–227 (1984)
Article MATH Google Scholar
Nemhauser, G., Wolsey, L., Fisher, M.: An analysis of the approximations for maximizing submodular set functions. Math. Program. 14, 265–294 (1978)
Article MathSciNet MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Article Google Scholar
Bischof, C.H., Quintana-Ortí, G.: Computing rank-revealing QR factorizations of dense matrices. ACM Trans. Math. Softw. 24, 226–253 (1998)
Article MathSciNet MATH Google Scholar
Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2010)
Chang, C.C., Lin, C.J.: LIBSVM data: classification, regression, and multi-label. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
University of Waikato, M.L.G. at: Weka 3: data mining software in java. http://www.cs.waikato.ac.nz/~ml/weka/index.html
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, Third edn. Morgan Kaufmann, (2011)
Google Scholar

Download references

Acknowledgement

This work was generously supported by the following funds: Hainan University’s Scientific Research Start-Up Fund; Ministry of Education of China’s Scientific Research Fund for the Returned Overseas Chinese Scholars; Hainan Province Natural Science Fund No. 20156243; China’s Natural Science Fund Nos. 11401146, 11471135, 61462022, 61562017, 61562018, 61562019; Hainan Province’s Major Science and Technology Project Grant No. ZDKJ2016015; Hainan Province’s Key Research and Development Program Grant Nos. ZDYF2017010 and ZDYF2017128. This work was also supported by the State Key Laboratory of Marine Resource Utilization in the South China Sea, Hainan University.

Author information

Authors and Affiliations

College of Information Science and Technology, Hainan University, Haikou, 570228, China
Qi Qi & Weimin Li
School of Mathematics and Statistics, Hainan Normal University, Haikou, 571158, China
Ni Li

Authors

Qi Qi
View author publications
You can also search for this author in PubMed Google Scholar
Ni Li
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ni Li .

Editor information

Editors and Affiliations

Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, China
Guoliang Chen
Sun Yat-sen University, Guangzhou, Guangdong, China
Hong Shen
Hainan University, Haikou, Hainan, China
Mingrui Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qi, Q., Li, N., Li, W. (2017). Exploration of Heuristic-Based Feature Selection on Classification Problems. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_9

Download citation

DOI: https://doi.org/10.1007/978-981-10-6442-5_9
Published: 06 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics