Skip to main content

Optimization of Data Preprocessing Procedure in the Systems of High Dimensional Data Clustering

  • Conference paper
  • First Online:
Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making (ISDMCI 2022)

Abstract

In this paper, we propose the technique of the optimal method choice of high dimensional data normalizing at the stage of data preprocessing procedure is performed. As well known, the qualitative carried out of the data preprocessing procedure significantly influences the further step of their processing such as classification, clustering, forecasting, etc. Within the framework of our research, we have used both the Shannon entropy and the relative ratio of Shannon entropy as the main criteria to evaluate the data normalizing quality. Before the apply the cluster analysis, we reduce the data dimensionality by using the principal component analysis. The obtained data clustering was performed using a fuzzy C-means clustering algorithm with an evaluation of the data clustering quality when using various methods of data normalizing. The analysis of the simulation results allows us to conclude that for this type of data (gene expression profiles) the decimal scaling method is optimal since the Shannon entropy of the investigated data achieves the minimal value in comparison with the use of other normalizing methods. Moreover, the relative ratio of Shannon entropy does not exceed the permissible norms during the data dimensionality reduction by applying the principal component analysis technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andrunyk, V., Vasevych, A., Chyrun, L., et al.: Development of information system for aggregation and ranking of news taking into account the user needs. In: CEUR Workshop Proceedings, vol. 2604, pp. 1127–1171 (2020)

    Google Scholar 

  2. Babichev, S., Škvor, J.: Technique of gene expression profiles extraction based on the complex use of clustering and classification methods. Diagnostics 10(8), 584 (2020). https://doi.org/10.3390/diagnostics10080584

  3. Babichev, S., Lytvynenko, V., Skvor, J., Korobchynskyi, M., Voronenko, M.: Information technology of gene expression profiles processing for purpose of gene regulatory networks reconstruction. In: Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP 2018, pp. 336–341 (2018). https://doi.org/10.1109/DSMP.2018.8478452

  4. Babichev, S., Osypenko, V., Lytvynenko, V., et al.: Comparison analysis of biclustering algorithms with the use of artificial data and gene expression profiles. In: 2018 IEEE 38th International Conference on Electronics and Nanotechnology, ELNANO 2018, Article No. 8477439 (2018). https://doi.org/10.1109/ELNANO.2018.8477439

  5. Babichev, S., Sharko, O., Sharko, A., Mikhalyov, O.: Soft filtering of acoustic emission signals based on the complex use of Huang transform and wavelet analysis. Adv. Intell. Syst. Comput. 1020, 3–19 (2020). https://doi.org/10.1007/978-3-030-26474-1_1

    Article  Google Scholar 

  6. Babichev, S.A., Kornelyuk, A.I., Lytvynenko, V.I., Osypenko, V.V.: Computational analysis of microarray gene expression profiles of lung cancer. Biopolymers Cell 32(1), 70–79 (2016). https://doi.org/10.7124/bc.00090F

    Article  Google Scholar 

  7. Bolstad, B., Irizarry, R., Astrand, M., Speed, T.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185–193 (2003). https://doi.org/10.1093/bioinformatics/19.2.185

    Article  Google Scholar 

  8. Das, A., Bhuyan, P.: Self-organizing tree algorithm (SOTA) clustering for defining level of service (LoS) criteria of urban streets. Period. Polytech. Transp. Eng. 47(4), 309–317 (2019). https://doi.org/10.3311/PPtr.9911

    Article  Google Scholar 

  9. Dussiau, C., Boussaroque, A., Gaillard, M., et al.: Hematopoietic differentiation is characterized by a transient peak of entropy at a single-cell level. BMC Biol. 20(1), 60 (2022). https://doi.org/10.1186/s12915-022-01264-9

  10. Ezugwu, A., Ikotun, A., Oyelade, O., et al.: A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 110, 104743 (2022). https://doi.org/10.1016/j.engappai.2022.104743

  11. Ganjalizadeh, V., Meena, G., Wall, T., et al.: Fast custom wavelet analysis technique for single molecule detection and identification. Nature Communications 13(1), art. no. 1035 (2022). https://doi.org/10.1038/s41467-022-28703-z5

  12. Golub, T., Slonim, D., Tamayo, P., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999). https://doi.org/10.1126/science.286.5439.531

    Article  Google Scholar 

  13. Gomez, S., Rodriguez, J., Rodriguez, F., Juez, F.: Analysis of the temporal structure evolution of physical systems with the self-organising tree algorithm (SOTA): application for validating neural network systems on adaptive optics data before on-sky implementation. Entropy 19(3), 103 (2017). https://doi.org/10.3390/e19030103

  14. Hayes, L., Basta, N., Muirhead, C., et al.: Temporal clustering of neuroblastic tumours in children and young adults from Ontario, Canada. Environ. Health Glob. Access Sci. Source 21(1), 30 (2022). https://doi.org/10.1186/s12940-022-00846-y

  15. Liang, L., Li, J., Yu, J., et al.: Establishment and validation of a novel invasion-related gene signature for predicting the prognosis of ovarian cancer. Cancer Cell Int. 22(1), 118 (2022). https://doi.org/10.1186/s12935-022-02502-4

  16. Lim, S., Lee, C., Tan, J., Lim, S., You, C.: Implementing self organising map to organise the unstructured data. J. Phys. Conf. Ser. 2129(1), 012046 (2021). https://doi.org/10.1088/1742-6596/2129/1/012046

  17. Litvinenko, V.I., Burgher, J.A., Vyshemirskij, V.S., Sokolova, N.A.: Application of genetic algorithm for optimization gasoline fractions blending compounding. In: Proceedings - 2002 IEEE International Conference on Artificial Intelligence Systems, ICAIS 2002, pp. 391–394 (2002). https://doi.org/10.1109/ICAIS.2002.1048134

  18. Liu, X., Zhao, J., Xue, L., et al.: A comparison of transcriptome analysis methods with reference genome. BMC Genom. 23(1), 232 (2022). https://doi.org/10.1186/s12864-022-08465-0

  19. Mohseni, M., Redies, C., Gast, V.: Approximate entropy in canonical and non-canonical fiction. Entropy 24(2), 278 (2022). https://doi.org/10.3390/e24020278

  20. Ramshaw, J.: Maximum entropy and constraints in composite systems. Phys. Rev. E 105(2), 024138 (2022). https://doi.org/10.1103/PhysRevE.105.024138

  21. Rosa, G.A., de Oliveira Ferreira, D., Pinheiro, A.P., Yamanaka, K.: Analysis of electricity customer clusters using self-organizing maps. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 295, pp. 312–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82196-8_24

    Chapter  Google Scholar 

  22. Sarswat, S., Aiswarya, R., Jose, J.: Shannon entropy of resonant scattered state in the e-c60elastic collision. J. Phys. B Atom. Mol. Opt. Phys. 55(5), 055003 (2022). https://doi.org/10.1088/1361-6455/ac5719

  23. Soni, N., Ganatra, A.: Categorization of several clustering algorithms from different perspective: a review. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(8), 63–68 (2012)

    Google Scholar 

  24. Whitaker, V., Oldham, M., Boyd, J., et al.: Clustering of health-related behaviours within children aged 11–16: a systematic review. BMC Publ. Health 21(1), 21 (2021). https://doi.org/10.1186/s12889-020-10140-6

  25. Xiong, K., Dong, Y., Zhao, S.: A clustering method with historical data to support large-scale consensus-reaching process in group decision-making. Int. J. Comput. Intell. Syst. 15(1), 1–21 (2022). https://doi.org/10.1007/s44196-022-00072-x

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maksym Korobchynskyi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Korobchynskyi, M., Rudenko, M., Dereko, V., Kovtun, O., Zaitsev, O. (2023). Optimization of Data Preprocessing Procedure in the Systems of High Dimensional Data Clustering. In: Babichev, S., Lytvynenko, V. (eds) Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making. ISDMCI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 149. Springer, Cham. https://doi.org/10.1007/978-3-031-16203-9_26

Download citation

Publish with us

Policies and ethics