Skip to main content

Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy

  • Conference paper
  • First Online:
14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019) (SOCO 2019)

Abstract

Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering, pp. 81–116. Springer, Heidelberg (2016)

    Chapter  Google Scholar 

  2. Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)

    Article  Google Scholar 

  3. Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn. Lett. 22(5), 563–582 (2001)

    Article  Google Scholar 

  4. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)

    Article  Google Scholar 

  5. Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: International Conference on Soft Computing in Data Science, pp. 72–88. Springer, Heidelberg (2016)

    Google Scholar 

  6. Milligan, G.W.: Clustering validation: results and implications for applied analyses. In: Clustering and Classification, pp. 341–375. World Scientific, Singapore (1996)

    Chapter  Google Scholar 

  7. Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster analysis. J. Classif. 5(2), 181–204 (1988)

    Article  MathSciNet  Google Scholar 

  8. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  Google Scholar 

  9. Dillon, W.R., Mulani, N., Frederick, D.G.: On the use of component scores in the presence of group structure. J. Consum. Res. 16(1), 106–112 (1989)

    Article  Google Scholar 

  10. Steinley, D.: Standardizing variables in k-means clustering. In: Classification, Clustering, and Data Mining Applications, pp. 53–60. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput. Stat. Data Anal. 52(10), 4658–4672 (2008)

    Article  MathSciNet  Google Scholar 

  12. Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)

    Article  Google Scholar 

  13. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)

    Article  Google Scholar 

  14. Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)

    MathSciNet  MATH  Google Scholar 

  15. Lu, Y., Wang, S., Li, S., Zhou, C.: Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach. Learn. 82(1), 43–70 (2011)

    Article  MathSciNet  Google Scholar 

  16. Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)

    Article  Google Scholar 

  17. Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. (2018)

    Google Scholar 

  18. Phan, A.V., Le Nguyen, M., Bui, L.T.: Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems. Appl. Intell. 46(2), 455–469 (2017)

    Article  Google Scholar 

  19. Chen, Y., Hao, Y.: A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)

    Article  Google Scholar 

  20. Gürüler, H.: A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method. Neural Comput. Appl. 28(7), 1657–1666 (2017)

    Article  Google Scholar 

  21. Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013)

    Article  Google Scholar 

  22. Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)

    Article  Google Scholar 

  23. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  24. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)

    Google Scholar 

  25. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

Download references

Acknowledgement

This work has been supported in part by the ELKARTEK program (SeNDANEU KK-2018/00032), the HAZITEK program (DATALYSE ZL-2018/00765) of the Basque Government and a TECNALIA Research and Innovation PhD Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iratxe Niño-Adan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Niño-Adan, I., Landa-Torres, I., Portillo, E., Manjarres, D. (2020). Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_2

Download citation

Publish with us

Policies and ethics