Abstract
Microarray data analysis needs utmost care as it plays a significant role in cancer study. Due to the excessive complexity of the data extraction process, it loses some relevant information (missing values) which leads to a significant irrecoverable disruption from the actual scenario. The imputation of missing values is a crucial preprocessing step in analyzing microarray data. Currently, numerous methodologies have been designed to resolve the problem, but the unsatisfactory outcome is obtained with high missing rates of data. In order to estimate the missing expression to complete the dataset, a novel method has been proposed based on the similarity index and generative adversarial network (Sim-GAN). Firstly, the raw dataset has been divided into two subsets, i.e., the target set (which contains genes with missing expression values) and the candidate set (contains without missing values). In the next step, the similarity index between target genes and candidate genes has been obtained. As microarray data represents several biological factors, three similarity matrices (structural similarity, functional similarity, and semantic similarity) have been derived to find the small subset of candidate genes for each target gene. In structural similarity, a novel approach has been used to reduce the time complexity is O(1) as well as tackle the nonlinearity. Now, the obtained subsets are fed into a generative adversarial network to compute the missing values of the targeted genomes. The experimental outcomes consolidate the claim that the proposed methodology gives a satisfactory performance in terms of meaningful expression values. A detailed comparative study based on several statistical (i.e., NRMSE, AUROC, etc.) and biological (i.e., CPP, BLCI) metrics to confirm that the proposed Sim-GAN outperforms the existing missing value estimation techniques.









Similar content being viewed by others
References
Al-Janabi S, Alkaim AF (2020) A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation. Soft Comput 24:555–569. https://doi.org/10.1007/s00500-019-03972-x
Bayrak T, Ogul H (2017) Microarray missing data imputation using regression. In: 2017 13th IASTED international conference on biomedical engineering (BioMed), pp 68–73
Bertsimas D, Pawlowski C, Zhuo YD (2018) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18:1–39
Bruckmaier G, Krauss S, Binder K et al (2021) Tversky and Kahneman’s cognitive illusions: who can solve them, and why? Front Psychol 12:584689. https://doi.org/10.3389/fpsyg.2021.584689
Chen X, Huang Y-A, Wang X-S et al (2016) FMLNCSIM: fuzzy measure-based lncRNA functional similarity calculation model. Oncotarget 7:45948–45958. https://doi.org/10.18632/oncotarget.10008
de Brevern AG, Hazout S, Malpertuy A (2004) Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform 5:114. https://doi.org/10.1186/1471-2105-5-114
Dzulkalnine MF, Sallehuddin R (2019) Missing data imputation with fuzzy feature selection for diabetes dataset. SN Appl Sci 1:362. https://doi.org/10.1007/s42452-019-0383-x
Das AK, Pati SK (2012) Gene subset selection for cancer classification using statsitical and rough set approach. Swarm, evolutionary, and memetic computing. In: SEMCCO 2012. LNCS, vol 7677, pp 294–302. https://doi.org/10.1007/978-3-642-35380-2_35
Ehsani R, Drabløs F (2016) TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinform 17:296. https://doi.org/10.1186/s12859-016-1160-0
Faisal S, Tutz G (2017) Missing value imputation for gene expression data by tailored nearest neighbors. Stat Appl Genet Mol Biol 16:95–106. https://doi.org/10.1515/sagmb-2015-0098
Gong W, Kwak I-Y, Pota P et al (2018) DrImpute: imputing dropout events in single cell RNAsequencing data. BMC Bioinform 19:220. https://doi.org/10.1186/s12859-018-2226y
Gong Y, Yu X, Ding Y, et al (2021) Effective fusion factor in FPN for tiny object detection. In: 2021 IEEE Winter conference on applications of computer vision (WACV). IEEE, Waikoloa, HI, USA, pp 1159–1167
Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in neural information processing systems. Curran Associates, Inc., pp 2672–2680
He C, Li H-H, Zhao C, et al (2015) Triple imputation for microarray missing value estimation. In: 2015 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 208–213
Jin L, Bi Y, Hu C et al (2021) A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci Rep 11:1760. https://doi.org/10.1038/s41598-021-81279-4
Keerin P, Kurutach W, Boongoen T (2016) A cluster-directed framework for neighbour based imputation of missing value in microarray data. IJDMB 15:165. https://doi.org/10.1504/IJDMB.2016.076535
Kim J, Tae D, Seok J (2020) A survey of missing data imputation using generative adversarial networks. In: 2020 International conference on artificial intelligence in information and communication (ICAIIC), pp 454–456
Lee D, Kim J, Moon W-J, Ye JC (2019) CollaGAN: collaborative GAN for missing image data imputation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2482–2491
Li J, Liu H (2002) Kent ridge bio-medical data set repository. http://datam.i2r.a-star.edu.sg/datasets/krbd
Liu Z, Lin W, Li X, et al (2021) ADNet: attention-guided deformable convolutional network for high dynamic range imaging. In: 2021 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW). IEEE, Nashville, TN, USA, pp 463–470
Maguitman AG, Menczer F, Erdinc F et al (2006) Algorithmic computation and approximation of semantic similarity. World Wide Web 9:431–456. https://doi.org/10.1007/s11280-006-8562-2
Mishra A, Naik B, Srichandan SK (2018) Missing value imputation using ANN optimized by genetic algorithm. IJAIE 5:41–57. https://doi.org/10.4018/IJAIE.2018070104
Nikfalazar S, Yeh C-H, Bedingfield S, Khorshidi HA (2020) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62:2419–2437. https://doi.org/10.1007/s10115-019-01427-1
Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52:709–750. https://doi.org/10.1007/s10115-017-1025-5
Purwar A, Singh SK (2015) Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl 42:5621–5631. https://doi.org/10.1016/j.eswa.2015.02.050
Rahman MdG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46:389–422. https://doi.org/10.1007/s10115-015-0822-y
Raimondi D, Passemiers A, Fariselli P, Moreau Y (2021) Current cancer driver variant predictors learn to recognize driver genes instead of functional variants. BMC Biol 19:3. https://doi.org/10.1186/s12915-020-00930-0
Satu MS, Khan MI, Rahman MR et al (2021) Diseasome and comorbidities complexities of SARS-CoV-2 infection with common malignant diseases. Brief Bioinform 22:1415–1429. https://doi.org/10.1093/bib/bbab003
Shang C, Palmer A, Sun J, et al (2017) VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE international conference on big data (big data). https://doi.org/10.1109/BigData.2017.8257992
Svedung Wettervik T, Howells T, Lewén A et al (2021) Temporal dynamics of ICP, CPP, PRx, and CPPopt in high-grade aneurysmal subarachnoid hemorrhage and the relation to clinical outcome. Neurocrit Care 34:390–402. https://doi.org/10.1007/s12028-020-01162-4
Teng Z, Guo M, Liu X et al (2013) Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 29:1424–1432. https://doi.org/10.1093/bioinformatics/btt160
Tsai C-F, Li M-L, Lin W-C (2018) A class center based approach for missing value imputation. Knowl Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026
Van Cleemput E, Vanierschot L, Fernández-Castilla B et al (2018) The functional characterization of grass- and shrubland ecosystems using hyperspectral remote sensing: trends, accuracy and moderating variables. Remote Sens Environ 209:747–763. https://doi.org/10.1016/j.rse.2018.02.030
Vijay SAA, GaneshKumar P (2021) Fuzzy system for classification of microarray data using a hybrid ant stem optimisation algorithm. IJAIP 18:154. https://doi.org/10.1504/IJAIP.2021.112902
Wang A, Chen Y, An N et al (2019) Microarray missing value imputation: a regularized local learning method. IEEE/ACM Trans Comput Biol Bioinform 16:980–993. https://doi.org/10.1109/TCBB.2018.2810205
Wang A, Yang J, An N (2021) Regularized sparse modelling for microarray missing value estimation. IEEE Access 9:16899–16913. https://doi.org/10.1109/ACCESS.2021.3053631
Xu T, Takano W (2021) Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16105–16114
Yang MQ, Weissman SM, Yang W et al (2018) MISC: missing imputation for single-cell RNA sequencing data. BMC Syst Biol 12:114. https://doi.org/10.1186/s12918-018-0638-y
Yang Y, Fu X, Qu W et al (2018) MiRGOFS: a GO-based functional similarity measurement for miRNAs, with applications to the prediction of miRNA subcellular localization and miRNA-disease association. Bioinformatics 34:3547–3556. https://doi.org/10.1093/bioinformatics/bty343
Yang Y, Xu Z, Song D (2016) Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinform 17:S10. https://doi.org/10.1186/s12859-015-0853-0
Yao W, Wang Y, Xu Y, Naayagi RT (2020) Communication time-delay stability margin analysis of the islanded microgrid under distributed secondary control. In: 2020 IEEE Power & Energy Society general meeting (PESGM), pp 1–5
Yoon J, Jordon J, van der Schaar M (2018) GAIN: missing data imputation using generative adversarial nets. In: International conference on machine learning, PMLR, pp 5689–5698
Zhu X, Wang J, Sun B et al (2021) An efficient ensemble method for missing value imputation in microarray gene expression data. BMC Bioinform 22:188. https://doi.org/10.1186/s12859-021-04109-4
Acknowledgements
The authors would like to thank anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interests
The authors declare that there are no conflicts of interest in this paper.
Ethical approval
This article does not contain any studies with human participants performed by any authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pati, S.K., Gupta, M.K., Shai, R. et al. Missing value estimation of microarray data using Sim-GAN. Knowl Inf Syst 64, 2661–2687 (2022). https://doi.org/10.1007/s10115-022-01718-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01718-0