KGA: integrating KPCA and GAN for microbial data augmentation

Wen, Liu-Ying; Zhang, Xiao-Min; Li, Qing-Feng; Min, Fan

doi:10.1007/s13042-022-01707-3

KGA: integrating KPCA and GAN for microbial data augmentation

Original Article
Published: 06 November 2022

Volume 14, pages 1427–1444, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Liu-Ying Wen¹,
Xiao-Min Zhang¹,
Qing-Feng Li² &
…
Fan Min ORCID: orcid.org/0000-0002-3290-1036^1,3

430 Accesses
7 Citations
Explore all metrics

Abstract

The data used for microbial-based disease diagnosis are characterized by small sample sizes, imbalanced categories, high dimensionality, and strong sparsity. They pose challenges to machine learning algorithms that aim to achieve good classification performance. In this paper, we propose a two-stage data augmentation method to enhance training data quality. The first stage is feature transformation. We design a KPCA-based method to map microbial data to a low-rank feature space, resulting in cleaner and more efficient data representation. This processing step addresses high dimensionality and strong sparsity in microbial data. The second stage is data augmentation. New synthetic data are obtained by augmenting the positive samples through the GAN. The misclassification cost is used to control the ratio of positive/negative samples in new data. The combination of the augmented data with the original data constitutes a cost-sensitive dataset, which can increase sample diversity while addressing the imbalance problem. This is more reasonable than traditional sampling methods that resolve the class imbalance. We compare the new method with four popular data augmentation algorithms on 12 imbalanced datasets. The experimental results demonstrate that (1) the samples augmented by the proposed algorithm are more diverse than those generated using compared resampling methods, such as SMOTE_ENN, and (2) the proposed algorithm not only achieves the lowest total misclassification cost but also outperforms other methods in terms of $F_2$ and G-mean metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-sensitive microbial data augmentation through matrix factorization

Article 30 September 2022

Ensemble microbial classification based on space partitioning and data augmentation

Article 29 November 2024

Microbial data augmentation combining feature extraction and transformer network

Article 17 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ai LY, Tian HY, Chen ZF, Chen HM, Xu J, Yuan FJ (2017) Systematic evaluation of supervised classifiers for fecal microbiota-based prediction of colorectal cancer. Oncotarget 8(6):9546–9556
Article Google Scholar
Asgari E, Garakani K, McHardy AC, Mofrad MR (2018) Micropheno: predicting environments and host phenotypes from 16s RRNA gene sequencing using a k-MER based representation of shallow sub-samples. Bioinformatics 34(13):i32–i42
Article Google Scholar
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Article MathSciNet Google Scholar
Batista GE, Bazzan AL, Monard MC et al (2003) Balancing training data for automated annotation of keywords: a case study. In: WOB, pp 10–18
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Article Google Scholar
Cammarota G, Ianiro G, Cianci R, Bibbò S, Gasbarrini A, Currò D (2015) The involvement of gut microbiota in inflammatory bowel disease pathogenesis: potential for therapy. Pharmacol Ther 149:191–212
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, Springer, pp 107–119
Chen HM, Yu Y, Wang JL, Lin YW, Kong X, Yang CQ, Yang L, Liu ZJ, Yuan YZ, Liu F, Wu JX, Zhong L, Fang DC, Zou WP, Fang JY (2013) Decreased dietary fiber intake and structural alteration of gut microbiota in patients with advanced colorectal adenoma. Am J Clin Nutr 97(5):1044–1052
Article Google Scholar
Chen T, Liu X, Feng R, Wang W, Yuan C, Lu W, He H, Gao H, Ying H, Chen DZ et al (2021) Discriminative cervical lesion detection in colposcopic images with global class activation and local bin excitation. IEEE J Biomed Health Inform 26(4):1411–1421
Article Google Scholar
Collado MC, Rautava S, Isolauri E, Salminen S (2015) Gut microbiota: a source of novel tools to reduce the risk of human disease? Pediatr Res 77(1):182–188
Article Google Scholar
Cox LM, Blaser MJ (2015) Antibiotics in early life and obesity. Nat Rev Endocrinol 11(3):182–190
Article Google Scholar
Dhar S, Cherkassky V (2014) Development and evaluation of cost-sensitive universum-SVM. IEEE Trans Cybern 45(4):806–818
Article Google Scholar
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639):115–118
Article Google Scholar
Gao H, Xu K, Cao M, Xiao J, Xu Q, Yin Y (2021) The deep features and attention mechanism-based method to dish healthcare under social IOT systems: an empirical study with a hand-deep local-global net. IEEE Trans Comput Soc Syst 9(1):336–347
Article Google Scholar
Gao H, Xiao J, Yin Y, Liu T, Shi J (2022) A mutually supervised graph attention network for few-shot segmentation: the perspective of fully utilizing limited samples. IEEE Trans Neural Netw Learn Syst
Gohir W, Ratcliffe EM, Sloboda DM (2015) Of the bugs that shape us: maternal obesity, the gut microbiome, and long-term disease risk. Pediatr Res 77(1):196–204
Article Google Scholar
Guo SY, Rong Z, Wang S, Wu YH (2022) A lidar slam with PCA-based feature extraction and two-stage matching. IEEE Trans Instrum Meas 71:1–11
Google Scholar
Halfvarson J, Brislawn CJ, Lamendella R, Vázquez-Baeza Y, Walters WA, Bramer LM, D’Amato M, Bonfiglio F, McDonald D, Gonzalez A, McClure EE, Dunklebarger M, Knight R, Jansson JK (2017) Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiolo 2(5):17004–17004
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322–1328
Kostic AD, Xavier RJ, Gevers D (2014) The microbiome in inflammatory bowel disease: current status and the future ahead. Gastroenterology 146(6):1489–1499
Article Google Scholar
Larsen PE, Dai Y (2015) Metabolome of human gut microbiome is predictive of host dysbiosis. GigaScience 4(1):s13742-015
Article Google Scholar
Last F, Douzas G, Bacao F (2017) Oversampling for imbalanced learning based on k-means and smote. arXiv preprint arXiv:1711.00837
Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X (2019) Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 166:4–21
Article Google Scholar
Li YX, Chai Y, Yin HP, Chen B (2021) A novel feature learning framework for high-dimensional data classification. Int J Mach Learn Cybern 12(2):555–569
Article Google Scholar
Liu Y, Kohlberger T, Norouzi M, Dahl GE, Smith JL, Mohtashamian A, Olson N, Peng LH, Hipp JD, Stumpe MC (2019) Artificial intelligence-based breast cancer nodal metastasis detection: insights into the black box for pathologists. Arch Pathol Lab Med 143(7):859–868
Article Google Scholar
Lo C, Marculescu R (2019) Metann: accurate classification of host phenotypes from metagenomic data using neural networks. BMC Bioinform 20(Suppl 12):1–14
Google Scholar
Luo S, Chen Z (2014) Sequential lasso cum EBIC for feature selection with ultra-high dimensional feature space. J Am Stat Assoc 109(507):1229–1240
Article MathSciNet MATH Google Scholar
Mahindru A, Sangal A (2021) Semidroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches. Int J Mach Learn Cybern 12(5):1369–1411
Article Google Scholar
Mani I, Zhang I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, ICML 126, pp 1–7
Mountassir A, Benbrahim H, Berrada I (2012) An empirical study to address the problem of unbalanced data sets in sentiment classification. In: 2012 IEEE international conference on systems, man, and cybernetics (SMC), IEEE, pp 3298–3303
Pasolli E, Truong DT, Malik F, Waldron L, Segata N (2016) Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput Biol 12(7):e1004977
Article Google Scholar
Reiman D, Metwally AA, Dai Y (2017) Using convolutional neural networks to explore the microbiome. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp 4269–4272
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Article MATH Google Scholar
Rosipal R, Girolami M, Trejo LJ, Cichocki A (2001) Kernel PCA for feature extraction and de-noising in nonlinear regression. Neural Comput Appl 10(3):231–243
Article MATH Google Scholar
Sahin Y, Bulkan S, Duman E (2013) A cost-sensitive decision tree approach for fraud detection. Expert Syst Appl 40(15):5916–5923
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A: Syst Hum 40(1):185–197
Article Google Scholar
Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Van Der Kooy K, Marton MJ, Witteveen AT et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Article Google Scholar
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, IEEE, pp 324–331
Wen LY, Luo CG, Wu WZ, Min F (2020) Multi-label symbolic value partitioning through random walks. Neurocomputing 387:195–209
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet MATH Google Scholar
Wu HL, Cai LH, Li DF, Wang XY, Zhao SC, Zou FH, Zhou K (2018) Metagenomics biomarkers selected for prediction of three different diseases in Chinese population. Biomed Res Int 2018:1–7
Google Scholar
Wu J, Wang J, Liu L (2007) Feature extraction via KPCA for classification of gait patterns. Hum Mov Sci 26(3):393–411
Article Google Scholar
Yang LY, Xu ZS (2019) Feature extraction by PCA and diagnosis of breast tumors using SVM with de-based parameter tuning. Int J Mach Learn Cybern 10(3):591–601
Article Google Scholar
Ye MC, Ji CX, Chen H, Lei L, Lu HJ, Qian YT (2020) Residual deep PCA-based feature extraction for hyperspectral image classification. Neural Comput Appl 32(18):14287–14300
Article Google Scholar
Zhang X, Yang Y, Li T, Zhang Y, Wang H, Fujita H (2021) CMC: a consensus multi-view clustering model for predicting Alzheimer’s disease progression. Comput Methods Programs Biomed 199:105895
Article Google Scholar
Zhang Y, Zhang HP (2013) Microbiota associated with type 2 diabetes and its related complications. Food Sci Hum Wellness 2(3–4):167–172
Article Google Scholar
Zhang ZL, Luo XG, García S, Herrera F (2017) Cost-sensitive back-propagation neural networks with binarization techniques in addressing multi-class problems and non-competent classifiers. Appl Soft Comput 56:357–367
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Central Government Funds of Guiding Local Scientific and Technological Development (No. 2021ZYD0003) and the Scientific Research Starting Project of SWPU (No. 2018QHR007).

Author information

Authors and Affiliations

School of Computer Science, Southwest Petroleum University, Chengdu, 610500, China
Liu-Ying Wen, Xiao-Min Zhang & Fan Min
Petroleum Engineering School, Southwest Petroleum University, Chengdu, 610500, China
Qing-Feng Li
Institute for Artificial Intelligence, Southwest Petroleum University, Chengdu, 610500, China
Fan Min

Authors

Liu-Ying Wen
View author publications
You can also search for this author inPubMed Google Scholar
Xiao-Min Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Qing-Feng Li
View author publications
You can also search for this author inPubMed Google Scholar
Fan Min
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Fan Min.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wen, LY., Zhang, XM., Li, QF. et al. KGA: integrating KPCA and GAN for microbial data augmentation. Int. J. Mach. Learn. & Cyber. 14, 1427–1444 (2023). https://doi.org/10.1007/s13042-022-01707-3

Download citation

Received: 09 May 2022
Accepted: 27 October 2022
Published: 06 November 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s13042-022-01707-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

KGA: integrating KPCA and GAN for microbial data augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cost-sensitive microbial data augmentation through matrix factorization

Ensemble microbial classification based on space partitioning and data augmentation

Microbial data augmentation combining feature extraction and transformer network

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now