Abstract
Due to the omnipresence of high-dimensional datasets, feature selection and ranking are very important steps in data preprocessing. In this work, we propose three transformations for real-valued features. The transformations are based on estimating the probability densities of the features. Originally, we propose modified distance measures for the ReliefF algorithm, which is one the most prominent feature ranking algorithms. To enable their comparison with the other feature ranking algorithms, we present data transformations that are mathematically equivalent to the modified distance measures. Finally, we evaluate our proposed transformations used in combination with several feature ranking methods on a set of benchmark datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Due to its implementation in Weka, the SVM-RFE algorithm could not be applied to datasets with non-binary nominal features, hence the results of SVM-RFE are based on 19 (and not 28) small datasets. From now on, we refer to SVM-RFE as SVM.
References
Visualization-based cancer microarray data classification analysis. http://www.biolab.si/supp/bi-cancer/projections. Accessed 04 Oct 2015
Abramowitz, M., Stegun, I.: Handbook of Mathematical Functions (1972)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 115–135 (2014)
Botev, Z., Grotowsky, J., Kroese, D.P.: Kernel density estimation via diffusion. Ann. Stat. 38(5), 2916–2957 (2010)
Bowling, S.R., Khasawneh, M.T., Kaewkuekool, S., Cho, B.R.: A logistic approximation to the cumulative normal distribution. J. Ind. Eng. Manag. 2, 114–127 (2009)
Cantelli, F.P.: Sulla determinazione empirica delle leggi di probabilita. Giornale dell’Istituto Italiano degli Attuari 4, 421–424 (1933)
Cao, X.H., Obradovic, Z.: A robust data scaling algorithm for gene expression classification. In: Proceedings of the 15th IEEE International Conference on Bioinformatics and Bioengineering (2015)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Glivenko, V.I.: Sulla determinazione empirica delle leggi di probabilita. Giornale dell’Istituto Italiano degli Attuari 2, 92–99 (1933)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1), 389–422 (2002)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. MorganKaufmann Publishers Inc., San Francisco (2011)
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI 1992, pp. 129–134 (1992)
Kononenko, I., Robnik-Šikonja, M.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. J. 53, 23–69 (2003)
Lewis, A.: Getdist. https://github.com/cmbant/getdist. Accessed 27 May 2016
Lichman, M.: UCI machine learning repository (2013)
Petković, M., Panov, P., Džeroski, S.: Improved ranking of numeric features with ReliefF. Presented at the Workshops on Machine Learning in Computational Biology (MLCB) & Machine Learning in Systems Biology (MLSB) (2015)
Rao, K.R., Kim, D.N., Hwang, J.J.: Fast Fourier Transform - Algorithms and Applications, 1st edn. Springer Publishing Company, Incorporated, Heidelberg (2010)
Slavkov, I.: An Evaluation Method for Feature Rankings. Ph.D. thesis, Mednarodna podiplomska šola Jožefa Stefana, Ljubljana (2012)
Stańczyk, U., Jain, L.C. (eds.): Feature Selection for Data and Pattern Recognition. Studies in Computational Intelligence, vol. 584. Springer, Heidelberg (2015). doi:10.1007/978-3-662-45620-0
Wu, C.: On the convergence properties of the EM algorithm. Ann. Stat. 11, 95–103 (1983)
Acknowledgements
We would like to acknowledge the support of the EC through the projects: MAESTRA (FP7-ICT-612944) and HBP (FP7-ICT-604102), and the Slovenian Research Agency through a young researcher grant and the program Knowledge Technologies (P2-0103).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Petković, M., Panov, P., Džeroski, S. (2016). A Comparison of Different Data Transformation Approaches in the Feature Ranking Context. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-46307-0_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46306-3
Online ISBN: 978-3-319-46307-0
eBook Packages: Computer ScienceComputer Science (R0)