Data augmentation for cancer classification in oncogenomics: an improved KNN based approach

Chaudhari, Poonam; Agarwal, Himanshu; Bhateja, Vikrant

doi:10.1007/s12065-019-00283-w

Data augmentation for cancer classification in oncogenomics: an improved KNN based approach

Special Issue
Published: 24 September 2019

Volume 14, pages 489–498, (2021)
Cite this article

Evolutionary Intelligence Aims and scope Submit manuscript

Poonam Chaudhari¹,
Himanshu Agarwal² &
Vikrant Bhateja³

1057 Accesses
Explore all metrics

Abstract

There is currently a great need for research in gene expression data to help with cancer classification in the field of oncogenomics. This is especially true since the disease occurs sporadically and often does not show symptoms. Typically, gene expression data is disproportionate with a large number of features and a low number of samples. A small sample size is likely to adversely affect accuracy of classification, as the performance of a classifier depends largely on the data. There is a pressing need to generate data which could be provided as better input to classifiers. Primitive augmentation techniques like uniform random generation and addition of noise do not assure good probability distribution. Secondly, as we deal with critical applications, the augmented data needs to have greater likelihood to the original values. Thus, we propose an improved variant of K-nearest neighborhood (KNN) rule. We use Counting Quotient Filter, Euclidean distance and mean best value from the k-neighbors for each target sample to get synthetic samples. A comparison is drawn amongst the raw data from public domain (original data), data generated using standard K-nearest neighbor rule and data generated using improved K-nearest neighbor rule. The data generated through these approaches is then further classified using state-of-art classifiers like SVM, J48 and DNN. The samples generated through our improvisation technique yield better recall values than the standard implementation; ensuring sensitivity of data. Average classification accuracy from all the three classifiers conclude enhancement of 7.72% as compared to traditional KNN approach and 16% when raw data is considered as input to the classifiers. Thus, the proposed algorithm attains two objectives; firstly, ensuring sensitivity of data for critical applications and secondly, enhancing classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Active Learning Using Fuzzy k-NN for Cancer Classification from Microarray Gene Expression Data

Cancer Classification Using Gene Expression Data

Combining the mRMR technique with the Northern Goshawk Algorithm (NGHA) to choose genes for cancer classification

Article 07 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bao L, Juan C, Li J, Zhang Y (2016) Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172:198–206
Article Google Scholar
Beckmann M, Ebecken NFF, Lima B (2015) A KNN undersampling approach for data balancing. J Intell Learn Syst Appl 7(4):104–116
Google Scholar
Bharathi A, Natarajan AM (2011) Cancer classification using support vector machines and relevance vector machine based on analysis of variance features. J Comput Sci 7(9):1393–1399
Article Google Scholar
Bhat RR, Viswanath V, Li X (2016) DeepCancer: detecting cancer through gene expressions via deep generative learning. In: IEEE 15th international conference on dependable, autonomic and secure computing, 15th international conference on pervasive intelligence and computing, 3rd international conference on big data intelligence and computing and cyber science and technology congress
Blagus L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinf. https://doi.org/10.1186/1471-2105-14-106
Article Google Scholar
Cao Z, Zhang S (2018) Sequence analysis simple tricks of convolutional neural network architectures improve DNA–protein binding prediction. Bioinformatics, ISSN 1460-2059
Carpten JC, Mardis ER (2018) The era of precision oncogenomics, Article from Cold Spring Harbor Molecular Case Studies. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5880272. Accessed 8 Nov 2018
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Clarkson K (1987) New applications of random sampling in computational geometry. Discrete Comput Geom 2(2):195–222
Article MathSciNet Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory Arch 13(1):21–27
Article Google Scholar
Domingos P, Hulten G (2001) Learning from infinite data in finite time. In: Levi E (ed) Advances in neural information processing systems, pp 673–680
Google Scholar
Duda et al (2000) Chapter non parametric techniques. In: Pattern classification, Wiley Interscience Publication, New York
Eghbal-zadeh H, Widmer G (2017) Likelihood estimation for generative adversarial networks, ICML Workshop on Implicit models, Machine Learning. Artificial Intelligence. arXiv:1707.07530
Gu J, Taylor CR, Phil D (2014) Practicing pathology in the era of big data and personalized medicine. Appl Immunohistochem Mol Morphol 22:1–9
Article Google Scholar
Hall P, Samworth BU (2008) Choice of neighbor order in nearest-neighbor classification. Ann Stat 36(5):2135–2152
MathSciNet MATH Google Scholar
He H, Bai Y et al (2008) ADASYN: adaptive synthetic sampling for imbalanced data. In: IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). https://doi.org/10.1109/ijcnn.2008.4633969
Hu S, Liang Y, Ma L, He Y (2010) MSMOTE: improving classification performance when training data is imbalanced. In: IEEE Xplore second international workshop on computer science and engineering. https://doi.org/10.1109/wcse.2009.756
Hussain Z et al (2018) Differential data augmentation techniques for medical imaging classification tasks. In: AMIA annual symposium, pp 979–984
Kaya Y, Pehlival H (2015) Comparison of classification algorithms in ECG beats by time series. In: IEEE Xplore 23nd signal processing and communications applications conference (SIU). https://doi.org/10.1109/siu.2015.7129845
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Article Google Scholar
Liu CH, Papadopoulou E, Lee D-T (2015) The k-nearest-neighbor Voronoi diagram revisited. J Algorithmica 71(2):429–449
Article MathSciNet Google Scholar
Ming H (2018) How to handle imbalance data. https://medium.com/james-blogs/handling-imbalanced-data-in-classification-problems-7de598c1059f. Accessed 27 Apr 2019
Mohsena H, El-Dahshan ESA, El-Horbaty E-SM, Salem A-BM (2018) Classification using deep learning neural networks for brain tumors. Future Comput Inf J 3(1):68–71
Article Google Scholar
More A (2016) Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv:1608.06048v1
NCBI. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4115. Accessed 14 Mar 2019
NCBI. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6919. Accessed 1 Jan 2019
NCBI. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4619. Accessed 20 Feb 2019
NCBI. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11223. Accessed 25 Feb 2019
NCBI. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE27567. Accessed 19 Apr 2019
O’Rourke J (1982) Computing the relative neighborhood graph in the L1 and L∞ metrics. Pattern Recogn 15(3):189–192
Article Google Scholar
Lucas A, Lopez-Tapia S, Molina R, Katsaggelos A (2019) Generative adversarial networks and perceptual losses for video super-resolution, IEEE Transactions on Image Processing-Early Access, Computer vision and pattern recognition. https://doi.org/10.1109/TIP.2019.2895768
Rung J, Brazma A (2012) Reuse of public genome-wide gene expression data. Nat Rev Genet. ISSN 1471-0064
Scitable by Nature Education (2014) Gene expression. https://www.nature.com/scitable/topicpage/gene-expression-14121669
Singh A, Dutta MK, Sharma DK (2016) Unique identification code for medical fundus images using blood vessel pattern for tele-ophthalmology applications. Comput Methods Programs Biomed 135:61–75
Article Google Scholar
Thirumuruganathan S (2010) A detailed introduction to K-nearest neighbor (KNN) algorithm. https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm. Accessed 13 Dec 2018
Venkatesan E, Velmurugan T (2015) Performance analysis of decision tree algorithms for breast cancer classification. Indian J Sci Technol 8:1–8
Article Google Scholar
WHA (2004) 57.13: Genomics and World Health, Fifty Seventh World Health Assembly Resolution
WHO (2002) Genomics and World Health: Report of the Advisory Committee on Health research, Geneva. https://apps.who.int/iris/handle/10665/42453. Accessed 21 Dec 2018
WHO (2019) Cancer: Early Detection. https://www.who.int/cancer/detection/en. Accessed 17 May 2019
Wong S et al (2016) Understanding data augmentation for classification: when to warp? In: International conference on digital image computing: techniques and applications (DICTA)
Yadav BSM, Velagaleti SB (2018) Challenges in handling imbalanced big data: a survey. TROI 5(3):1–58
Google Scholar
Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol 8(3), Article 43
Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C (2018) A reliable method for colorectal cancer prediction based on feature selection and support vector machine. J Med Biol Eng Comput 57(4):901–912
Article Google Scholar

Download references

Funding

No funding from anyone.

Author information

Authors and Affiliations

Computer Engineering Department, Gokhale Education Society’s R H Sapat COE, MS&R, Nashik, India
Poonam Chaudhari
Symbiosis Institute of Technology, Pune, India
Himanshu Agarwal
Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges (SRMGPC), Lucknow, India
Vikrant Bhateja

Authors

Poonam Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Vikrant Bhateja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Poonam Chaudhari.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaudhari, P., Agarwal, H. & Bhateja, V. Data augmentation for cancer classification in oncogenomics: an improved KNN based approach. Evol. Intel. 14, 489–498 (2021). https://doi.org/10.1007/s12065-019-00283-w

Download citation

Received: 19 June 2019
Revised: 14 August 2019
Accepted: 30 August 2019
Published: 24 September 2019
Issue Date: June 2021
DOI: https://doi.org/10.1007/s12065-019-00283-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data augmentation for cancer classification in oncogenomics: an improved KNN based approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Active Learning Using Fuzzy k-NN for Cancer Classification from Microarray Gene Expression Data

Cancer Classification Using Gene Expression Data

Combining the mRMR technique with the Northern Goshawk Algorithm (NGHA) to choose genes for cancer classification

Explore related subjects

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now