Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications

Cateni, Silvia; Colla, Valentina

doi:10.1007/978-3-319-44188-7_27

Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications

Silvia Cateni¹² &
Valentina Colla¹²

Conference paper
First Online: 19 August 2016

2188 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 629))

Abstract

In many real word applications of neural networks and other machine learning approaches, large experimental datasets are available, containing a huge number of variables, whose effect on the considered system or phenomenon is not completely known or not deeply understood. Variable selection procedures identify a small subset from original feature space in order to point out the input variables, which mainly affect the considered target. The identification of such variables leads to very important advantages, such as lower complexity of the model and of the learning algorithm, savings of computational time and improved performance. Moreover, variable selection procedures can help to acquire a deeper knowledge of the considered problem, system or phenomenon by identifying the factors which mostly affect it. This concept is strictly linked to the crucial aspect of the stability of the variable selection, defined as the sensitivity of a machine learning model with respect to variations in the dataset that is exploited in its training phase. In the present review, different categories of variable section procedures are presented and discussed, in order to highlight strengths and weaknesses of each method in relation to the different tasks and to the variables of the considered dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Ali, S., Smith, K.: On learning algorithm selection for classification. Appl. Soft Comput. 6(2), 119–138 (2006)
Article Google Scholar
Brazdil, P., Carrier, C., Soares, C., Vilalta, R.: Metalearning: Applications to Data Mining. Springer, Heidelberg (2008)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth and Brooks, Belmont (1984)
MATH Google Scholar
Brodley, C.E.: Addressing the selective superiority problem: automatic algorithm/model class selection. In: Proceedings of 10th International Conference on Machine Learning, pp. 17–24 (1993)
Google Scholar
Cam, L.L.: Maximum likelihood an introduction. ISI Rev. 58(2), 153–171 (1990)
MATH Google Scholar
Cateni, S., Colla, V.: The importance of variable selection for neuralnetworks-based classification in an industrial context. In: Advances in Neural Networks (Computational Intelligence for ICT). Smart Innovation, Systems and Technologies (2016)
Google Scholar
Cateni, S., Colla, V.: Improving the stability of wrapper variable selection applied to binary classification. Int. J. Comput. Inf. Syst. Ind. Manage. Appl. (2016) (in press)
Google Scholar
Cateni, S., Colla, V.: A hybrid variable selection approach for nn-based classification in industrial context. In: Smart Innovation, Systems and Technologies (in press)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: General purpose input vvariable extraction: a genetic algorithm based procedure give a gap. In: Proceedings of the 9th International Conference on Intelligence Systems design and Applications ISDA09 (2009)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: Variable selection through genetic algorithms for classification purpose. In: IASTED International Conference on Artificial Intelligence and Applications (2010)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A genetic algorithm-based approach for selecting input variables and setting relevant network parameters of a som-based classifier. Int. J. Simul. Syst. Sci. Technol. 12(2), 30–37 (2011)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A hybrid feature selection method for classification purposes. In: 8th European Modeling Symposium on Mathematical Modeling and Computer simulation EMS2014, Pisa, Italy, vol. 1, pp. 1–8 (2014)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A fuzzy system for combining filter features selection methods. Int. J. Fuzzy Syst. (In press)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A genetic algorithms-based approach for selecting the most relevant input variables in classification tasks. In: Proceedings of the 4th European Modelling Symposium on Mathematical Modelling and Computer simulation EMS2010, Pisa, pp. 63–67, 17–19 November 2010
Google Scholar
Cateni, S., Colla, V., Vannucci, M., Vannocci, M.: A procedure for building reduced reliable training datasets from real-world data. In: 13th IASTED International Conference on Artificial Intelligence and Applications AIA 2014, Innsbruck, Austria, 17–19 Febbraio 2014
Google Scholar
Coetzee, F.: Correcting the kullback-leibler distance for feature selection. Pattern Recogn. Lett. 26(11), 1675–1683 (2005)
Article Google Scholar
Colla, V., Nastasi, G., Cateni, S., Vannucci, M., Vannocci, M.: Genetic algorithms applied to discrete distribution fitting. In: Proceedings - UKSim-AMSS 7th European Modelling Symposium on Computer Modelling and Simulation, EMS 2013, pp. 30–35 (2013)
Google Scholar
Colla, V., Nastasi, G., Materese, N.: Gadf - genetic algorithms for distribution fitting. In: Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA 2010, pp. 6–11 (2010)
Google Scholar
Colla, V., Valentini, R., Bioli, G.: Mechanical properties prediction for aluminium-killed and interstitial-free steels, Revue de Métalurgie Special Issue JSI, pp. 100–101, Dicembre 2004
Google Scholar
Colla, V., Vannucci, M., Fera, S., Valentini, R.: Ca-treatment of al-killed steels: inclusion modification and application of artificial neural networks for the prediction of clogging. In: Proceedings of 5th European Oxygen Steelmaking Conference EOSC 2006, vol. 1, pp. 387–394, 26–28 June 2006
Google Scholar
Colla, V., Vannucci, M., Reyneri, L.: Artificial neural networks applied for estimating a probability density function. Intell. Data Anal. 19(1), 29–41 (2014)
Google Scholar
Cover, T., Thomas, A.: Elements of Information Theory. Telecommunications and Signal Processing. Wiley Series, Hoboken (1991)
Book Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New York (2001)
MATH Google Scholar
Eid, H., Hassanien, A., Kim, T.H., Banerjee, S.: Linear correlation-based feature selection for network intrusion detection model. Commun. Comput. Inf. Sci. 381, 240–248 (2013)
Article Google Scholar
Epanechnikov, V.: Non-parametric estimation of a multivariate probability density. Theory Probab. Appl. 14, 239–250 (1969)
Article MATH Google Scholar
Fakhraei, S., Zadeh, H.S., Fotouhi, F.: Bias and stability of single variable classifiers for feature ranking and selection. Expert Syst. Appl. 14(15), 6945–6958 (2014)
Article Google Scholar
Fiasché, M., Taisch, M.: On the use of quantum-inspired optimization techniques for training spiking neural networks: a new method proposed. Smart Innov. Syst. Technol. 37, 359–368 (2015)
Article Google Scholar
Froelich, W., Wrobel, K., Porwik, P.: Diagnosing parkinson’s disease using the classification of speech signals. J. Med. Inform. Technol. 23, 187–194 (2014)
Google Scholar
Ghumbre, S., Patil, C., Ghatol, A.: Heart disease diagnosis using support vector machine. In: International Conference on Computer Science and Information Technology, ICCSIT 2011, Pattaya (2011)
Google Scholar
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Lander, C.B.E.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Mach. Learn. 3, 1157–1182 (2003)
MATH Google Scholar
Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)
Google Scholar
Kalousis, A., Gama, J., Hilario, M.: On data and algorithms: understanding inductive performance. Mach. Learn. 54(3), 275–312 (2004)
Article MATH Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. In: Proceedings of 5th IEEE International Conference on Data Mining, ICDM 2005, pp. 218–225 (2005)
Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007)
Article Google Scholar
Khashei, M., Eftekhari, S., Parvizian, J.: Diagnosing diabetes type ii using a soft intelligent binary classification model. Rev. Bioinform. Biometrics (RBB) 1, 9–23 (2012)
Google Scholar
Koc, L., Carswell, A.D.: Network intrusion detection using a hnb binary classifier. In: 17th UKSIM-AMSS International Conference on Modelling and Simulation (2015)
Google Scholar
Kohavi, R., John, G.: Wrappers for feature selection. Artif. Intell. 97, 273–324 (1997)
Article MATH Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
Article MathSciNet MATH Google Scholar
Loscalzo, S., Yu, L., Ding, C.: Consensus group stable feature selection. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 1, pp. 567–575 (2009)
Google Scholar
Mammone, N., Fiasché, M., Inuso, G., La Foresta, F., Morabito, C.F.C., Versaci, M.: Information theoretic learning for inverse problem resolution in bio-electromagnetism. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part III. LNCS (LNAI), vol. 4694, pp. 414–421. Springer, Heidelberg (2007)
Chapter Google Scholar
May, R., Dandy, G., Maier, H.: Review of input variable selection methods for artificial neural networks. In: Suzuki, K. (ed.) Artificial Neural Networks Methodological Advances and Biomedical Applications. INTECH, Rijeka (2011)
Google Scholar
Mitchell, T., Toby, J., Beauchamp, J.: Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83, 1023–1032 (1988)
Article MathSciNet MATH Google Scholar
Novovicova, J., Somol, P., Pudil, P.: A new measure of feature selection algorithms stability. In: IEEE International Conference on Data Mining Workshops, vol. 1, pp. 382–387 (2009)
Google Scholar
Reyneri, L., Colla, V., Vannucci, M.: Estimate of a probability density function through neural networks. In: Cabestany, J., Rojas, I., Joya, G. (eds.) IWANN 2011, Part I. LNCS, vol. 6691, pp. 57–64. Springer, Heidelberg (2011)
Chapter Google Scholar
Rice, J.A.: Mathematical Statistics and Data Analysis, 3rd edn. Duxbury Advanced, Duxbury (2006)
Google Scholar
Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 443–471 (1956)
Article MathSciNet MATH Google Scholar
Schliebs, S., Defoin-Platel, M., Worner, S., Kasabov, N.: Integrated feature and parameter optimization for an evolving spiking neural network: exploring heterogeneous probabilistic models. Neural Netw. 22(5–6), 623–632 (2009)
Article Google Scholar
Shetty, M., Shekokar, N.M.: Data mining techniques for real time intrusion detection systems. Int. J. Sci. Eng. Res. 3(4), 1–7 (2012)
Google Scholar
Sokolova, M., Lapalme, G.: A systsystem analysis of performance measures for classification tasks. Inf. Process. Manage. 45, 427–437 (2009)
Article Google Scholar
Song, Q.B., Wang, G., Wang, C.: Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recogn. 45(7), 2672–2689 (2012)
Article Google Scholar
Souza, F., Araujo, R., Soares, S., Mendes, J.: Variable selection based on mutual information for soft sensors application. In: Proceedings of the 9th Portuguese Conference on Automatic Control, Controlo 2010, At Coimbra, Portugal (2010)
Google Scholar
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904)
Article Google Scholar
Stigler, S.: Francis galtons account of the invention of correlation. Stat. Sci. 4(2), 73–79 (1989)
Article MathSciNet MATH Google Scholar
Sun, Y., Robinson, M., Adams, R., Boekhorst, R., Rust, A.G., Davey, N.: Using feature selection filtering methods for binding site predictions. In: Proceedings of 5th IEEE International Conference on Cognitive Informatics, ICCI 2006 (2006)
Google Scholar
Turney, P.: Techncal note: bias and the quantification of stability. Mach. Learn. 20, 23–33 (1995)
Google Scholar
Vitalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)
Article Google Scholar
Wang, G., Song, Q., Sun, H., Zhou, X.: A feature subset selection algorithm automatic reccomendation method. J. Artif. Intell. Res. 47, 1–34 (2013)
MATH Google Scholar
Wang, S., Zhu, J.: Variable selection for model-based high dimensional clustering and its application on microarray data. Biometrics 64, 440–448 (2008)
Article MathSciNet MATH Google Scholar
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 803–811 (2008)
Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation basedfilter solution. In: Proceedings of the 20th International Conference on Machine Learning ICML, vol. 1, pp. 856–863 (2003)
Google Scholar
Zhang, K., Li, Y., Scarf, P., Ball, A.: Feature selection for high-dimensional machinery fault diagnosis data using multiple models and radial basis function networks. Neurocomputing 74, 2941–2952 (2011)
Article Google Scholar
Zhao, Z., Liu, H.: Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence Morgan Kaufmann Publishers Inc., pp. 1156–1161 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

TeCIP Institute, Scuola Superiore Sant’Anna, via G. Moruzzi, 1, 56124, Pisa, Italy
Silvia Cateni & Valentina Colla

Authors

Silvia Cateni
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Colla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentina Colla .

Editor information

Editors and Affiliations

Robert Gordon University, Aberdeen, United Kingdom
Chrisina Jayne
Lab of Forest Informatics (FiLAB), Democritus University of Thrace Lab of Forest Informatics (FiLAB), Orestiada, Greece
Lazaros Iliadis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cateni, S., Colla, V. (2016). Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications. In: Jayne, C., Iliadis, L. (eds) Engineering Applications of Neural Networks. EANN 2016. Communications in Computer and Information Science, vol 629. Springer, Cham. https://doi.org/10.1007/978-3-319-44188-7_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-44188-7_27
Published: 19 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44187-0
Online ISBN: 978-3-319-44188-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics