Abstract
In this paper, we propose the technique of the optimal method choice of high dimensional data normalizing at the stage of data preprocessing procedure is performed. As well known, the qualitative carried out of the data preprocessing procedure significantly influences the further step of their processing such as classification, clustering, forecasting, etc. Within the framework of our research, we have used both the Shannon entropy and the relative ratio of Shannon entropy as the main criteria to evaluate the data normalizing quality. Before the apply the cluster analysis, we reduce the data dimensionality by using the principal component analysis. The obtained data clustering was performed using a fuzzy C-means clustering algorithm with an evaluation of the data clustering quality when using various methods of data normalizing. The analysis of the simulation results allows us to conclude that for this type of data (gene expression profiles) the decimal scaling method is optimal since the Shannon entropy of the investigated data achieves the minimal value in comparison with the use of other normalizing methods. Moreover, the relative ratio of Shannon entropy does not exceed the permissible norms during the data dimensionality reduction by applying the principal component analysis technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrunyk, V., Vasevych, A., Chyrun, L., et al.: Development of information system for aggregation and ranking of news taking into account the user needs. In: CEUR Workshop Proceedings, vol. 2604, pp. 1127–1171 (2020)
Babichev, S., Škvor, J.: Technique of gene expression profiles extraction based on the complex use of clustering and classification methods. Diagnostics 10(8), 584 (2020). https://doi.org/10.3390/diagnostics10080584
Babichev, S., Lytvynenko, V., Skvor, J., Korobchynskyi, M., Voronenko, M.: Information technology of gene expression profiles processing for purpose of gene regulatory networks reconstruction. In: Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP 2018, pp. 336–341 (2018). https://doi.org/10.1109/DSMP.2018.8478452
Babichev, S., Osypenko, V., Lytvynenko, V., et al.: Comparison analysis of biclustering algorithms with the use of artificial data and gene expression profiles. In: 2018 IEEE 38th International Conference on Electronics and Nanotechnology, ELNANO 2018, Article No. 8477439 (2018). https://doi.org/10.1109/ELNANO.2018.8477439
Babichev, S., Sharko, O., Sharko, A., Mikhalyov, O.: Soft filtering of acoustic emission signals based on the complex use of Huang transform and wavelet analysis. Adv. Intell. Syst. Comput. 1020, 3–19 (2020). https://doi.org/10.1007/978-3-030-26474-1_1
Babichev, S.A., Kornelyuk, A.I., Lytvynenko, V.I., Osypenko, V.V.: Computational analysis of microarray gene expression profiles of lung cancer. Biopolymers Cell 32(1), 70–79 (2016). https://doi.org/10.7124/bc.00090F
Bolstad, B., Irizarry, R., Astrand, M., Speed, T.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185–193 (2003). https://doi.org/10.1093/bioinformatics/19.2.185
Das, A., Bhuyan, P.: Self-organizing tree algorithm (SOTA) clustering for defining level of service (LoS) criteria of urban streets. Period. Polytech. Transp. Eng. 47(4), 309–317 (2019). https://doi.org/10.3311/PPtr.9911
Dussiau, C., Boussaroque, A., Gaillard, M., et al.: Hematopoietic differentiation is characterized by a transient peak of entropy at a single-cell level. BMC Biol. 20(1), 60 (2022). https://doi.org/10.1186/s12915-022-01264-9
Ezugwu, A., Ikotun, A., Oyelade, O., et al.: A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 110, 104743 (2022). https://doi.org/10.1016/j.engappai.2022.104743
Ganjalizadeh, V., Meena, G., Wall, T., et al.: Fast custom wavelet analysis technique for single molecule detection and identification. Nature Communications 13(1), art. no. 1035 (2022). https://doi.org/10.1038/s41467-022-28703-z5
Golub, T., Slonim, D., Tamayo, P., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999). https://doi.org/10.1126/science.286.5439.531
Gomez, S., Rodriguez, J., Rodriguez, F., Juez, F.: Analysis of the temporal structure evolution of physical systems with the self-organising tree algorithm (SOTA): application for validating neural network systems on adaptive optics data before on-sky implementation. Entropy 19(3), 103 (2017). https://doi.org/10.3390/e19030103
Hayes, L., Basta, N., Muirhead, C., et al.: Temporal clustering of neuroblastic tumours in children and young adults from Ontario, Canada. Environ. Health Glob. Access Sci. Source 21(1), 30 (2022). https://doi.org/10.1186/s12940-022-00846-y
Liang, L., Li, J., Yu, J., et al.: Establishment and validation of a novel invasion-related gene signature for predicting the prognosis of ovarian cancer. Cancer Cell Int. 22(1), 118 (2022). https://doi.org/10.1186/s12935-022-02502-4
Lim, S., Lee, C., Tan, J., Lim, S., You, C.: Implementing self organising map to organise the unstructured data. J. Phys. Conf. Ser. 2129(1), 012046 (2021). https://doi.org/10.1088/1742-6596/2129/1/012046
Litvinenko, V.I., Burgher, J.A., Vyshemirskij, V.S., Sokolova, N.A.: Application of genetic algorithm for optimization gasoline fractions blending compounding. In: Proceedings - 2002 IEEE International Conference on Artificial Intelligence Systems, ICAIS 2002, pp. 391–394 (2002). https://doi.org/10.1109/ICAIS.2002.1048134
Liu, X., Zhao, J., Xue, L., et al.: A comparison of transcriptome analysis methods with reference genome. BMC Genom. 23(1), 232 (2022). https://doi.org/10.1186/s12864-022-08465-0
Mohseni, M., Redies, C., Gast, V.: Approximate entropy in canonical and non-canonical fiction. Entropy 24(2), 278 (2022). https://doi.org/10.3390/e24020278
Ramshaw, J.: Maximum entropy and constraints in composite systems. Phys. Rev. E 105(2), 024138 (2022). https://doi.org/10.1103/PhysRevE.105.024138
Rosa, G.A., de Oliveira Ferreira, D., Pinheiro, A.P., Yamanaka, K.: Analysis of electricity customer clusters using self-organizing maps. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 295, pp. 312–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82196-8_24
Sarswat, S., Aiswarya, R., Jose, J.: Shannon entropy of resonant scattered state in the e-c60elastic collision. J. Phys. B Atom. Mol. Opt. Phys. 55(5), 055003 (2022). https://doi.org/10.1088/1361-6455/ac5719
Soni, N., Ganatra, A.: Categorization of several clustering algorithms from different perspective: a review. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(8), 63–68 (2012)
Whitaker, V., Oldham, M., Boyd, J., et al.: Clustering of health-related behaviours within children aged 11–16: a systematic review. BMC Publ. Health 21(1), 21 (2021). https://doi.org/10.1186/s12889-020-10140-6
Xiong, K., Dong, Y., Zhao, S.: A clustering method with historical data to support large-scale consensus-reaching process in group decision-making. Int. J. Comput. Intell. Syst. 15(1), 1–21 (2022). https://doi.org/10.1007/s44196-022-00072-x
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Korobchynskyi, M., Rudenko, M., Dereko, V., Kovtun, O., Zaitsev, O. (2023). Optimization of Data Preprocessing Procedure in the Systems of High Dimensional Data Clustering. In: Babichev, S., Lytvynenko, V. (eds) Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making. ISDMCI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 149. Springer, Cham. https://doi.org/10.1007/978-3-031-16203-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-16203-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16202-2
Online ISBN: 978-3-031-16203-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)