Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy

Niño-Adan, Iratxe; Landa-Torres, Itziar; Portillo, Eva; Manjarres, Diana

doi:10.1007/978-3-030-20055-8_2

Iratxe Niño-Adan¹⁹,
Itziar Landa-Torres²⁰,
Eva Portillo²¹ &
…
Diana Manjarres¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 950))

Included in the following conference series:

International Workshop on Soft Computing Models in Industrial and Environmental Applications

1410 Accesses

Abstract

Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering, pp. 81–116. Springer, Heidelberg (2016)
Chapter Google Scholar
Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)
Article Google Scholar
Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn. Lett. 22(5), 563–582 (2001)
Article Google Scholar
Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)
Article Google Scholar
Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: International Conference on Soft Computing in Data Science, pp. 72–88. Springer, Heidelberg (2016)
Google Scholar
Milligan, G.W.: Clustering validation: results and implications for applied analyses. In: Clustering and Classification, pp. 341–375. World Scientific, Singapore (1996)
Chapter Google Scholar
Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster analysis. J. Classif. 5(2), 181–204 (1988)
Article MathSciNet Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar
Dillon, W.R., Mulani, N., Frederick, D.G.: On the use of component scores in the presence of group structure. J. Consum. Res. 16(1), 106–112 (1989)
Article Google Scholar
Steinley, D.: Standardizing variables in k-means clustering. In: Classification, Clustering, and Data Mining Applications, pp. 53–60. Springer, Heidelberg (2004)
Chapter Google Scholar
Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput. Stat. Data Anal. 52(10), 4658–4672 (2008)
Article MathSciNet Google Scholar
Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)
Article Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)
Article Google Scholar
Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)
MathSciNet MATH Google Scholar
Lu, Y., Wang, S., Li, S., Zhou, C.: Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach. Learn. 82(1), 43–70 (2011)
Article MathSciNet Google Scholar
Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)
Article Google Scholar
Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. (2018)
Google Scholar
Phan, A.V., Le Nguyen, M., Bui, L.T.: Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems. Appl. Intell. 46(2), 455–469 (2017)
Article Google Scholar
Chen, Y., Hao, Y.: A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)
Article Google Scholar
Gürüler, H.: A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method. Neural Comput. Appl. 28(7), 1657–1666 (2017)
Article Google Scholar
Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013)
Article Google Scholar
Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
Google Scholar
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

Download references

Acknowledgement

This work has been supported in part by the ELKARTEK program (SeNDANEU KK-2018/00032), the HAZITEK program (DATALYSE ZL-2018/00765) of the Basque Government and a TECNALIA Research and Innovation PhD Scholarship.

Author information

Authors and Affiliations

Tecnalia Research and Innovation, 48160, Derio, Spain
Iratxe Niño-Adan & Diana Manjarres
Petronor Innovación S.L., 48550, Muskiz, Spain
Itziar Landa-Torres
Department of Automatic Control and System Engineering, School of Engineering, University of the Basque Country, UPV/EHU, 48013, Bilbao, Spain
Eva Portillo

Authors

Iratxe Niño-Adan
View author publications
You can also search for this author in PubMed Google Scholar
Itziar Landa-Torres
View author publications
You can also search for this author in PubMed Google Scholar
Eva Portillo
View author publications
You can also search for this author in PubMed Google Scholar
Diana Manjarres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iratxe Niño-Adan .

Editor information

Editors and Affiliations

Data Science and Big Data Lab, Pablo de Olavide University, Seville, Spain
Francisco Martínez Álvarez
Data Science and Big Data Lab, Pablo de Olavide University, Seville, Spain
Alicia Troncoso Lora
University of Salamanca, Salamanca, Spain
José António Sáez Muñoz
Department of Industrial Engineering, University of A Coruña, A Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Niño-Adan, I., Landa-Torres, I., Portillo, E., Manjarres, D. (2020). Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-20055-8_2
Published: 01 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20054-1
Online ISBN: 978-3-030-20055-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics