Abstract
Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering, pp. 81–116. Springer, Heidelberg (2016)
Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)
Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn. Lett. 22(5), 563–582 (2001)
Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)
Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: International Conference on Soft Computing in Data Science, pp. 72–88. Springer, Heidelberg (2016)
Milligan, G.W.: Clustering validation: results and implications for applied analyses. In: Clustering and Classification, pp. 341–375. World Scientific, Singapore (1996)
Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster analysis. J. Classif. 5(2), 181–204 (1988)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Dillon, W.R., Mulani, N., Frederick, D.G.: On the use of component scores in the presence of group structure. J. Consum. Res. 16(1), 106–112 (1989)
Steinley, D.: Standardizing variables in k-means clustering. In: Classification, Clustering, and Data Mining Applications, pp. 53–60. Springer, Heidelberg (2004)
Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput. Stat. Data Anal. 52(10), 4658–4672 (2008)
Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)
Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)
Lu, Y., Wang, S., Li, S., Zhou, C.: Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach. Learn. 82(1), 43–70 (2011)
Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)
Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. (2018)
Phan, A.V., Le Nguyen, M., Bui, L.T.: Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems. Appl. Intell. 46(2), 455–469 (2017)
Chen, Y., Hao, Y.: A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)
Gürüler, H.: A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method. Neural Comput. Appl. 28(7), 1657–1666 (2017)
Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013)
Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
Acknowledgement
This work has been supported in part by the ELKARTEK program (SeNDANEU KK-2018/00032), the HAZITEK program (DATALYSE ZL-2018/00765) of the Basque Government and a TECNALIA Research and Innovation PhD Scholarship.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Niño-Adan, I., Landa-Torres, I., Portillo, E., Manjarres, D. (2020). Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-20055-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20054-1
Online ISBN: 978-3-030-20055-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)