Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

Li, Jinyan; Fong, Simon; Mohammed, Sabah; Fiaidhi, Jinan

doi:10.1007/s11227-015-1541-6

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

Published: 16 November 2015

Volume 72, pages 3708–3728, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jinyan Li¹,
Simon Fong¹,
Sabah Mohammed² &
…
Jinan Fiaidhi²

915 Accesses
34 Citations
Explore all metrics

Abstract

Classification which is a popular supervised machine learning method has many applications in computational biology, where data samples are automatically categorized into predefined labels with the aid of data mining. Often the training samples contain very few instances of interest (e.g., medical anomalies, rare disease in a population, and unusual syndromes, etc.), but many normal instances. Such imbalanced ratio of data distributions among the target labels hampers the efficacy of classification algorithms, because the induced model has not been trained with sufficient amount of instances of the interesting label(s), but overwhelmed with ordinary training records. Traditional remedies attempt to rebalance the data distributions of the target classes, by inflating the interesting instances artificially, reducing the majority of the common instances or a combination of both. Though the fundamental concept is effective, there is no clear guideline on how to strike a balance between fabricating the rare samples and reducing the norms, with the purpose of maximizing the classification accuracy. In this paper, an optimization model using different swarm strategies (Bat-inspired algorithm and PSO) is proposed for adaptively balancing the increase/decrease of the class distribution, depending on the properties of the biological datasets. The optimization is extended for achieving the highest possible accuracy and Kappa statistics at the same time as well. The optimization model is tested on five imbalanced medical datasets, which are sourced from lung surgery logs and virtual screening of bioassay data. Computer simulation results show that the proposed optimization model outperforms other class balancing methods in medical data classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

New cardiovascular disease prediction approach using support vector machine and quantum-behaved particle swarm optimization

Article Open access 05 August 2023

E. I. Elsedimy, Sara M. M. AboHashish & Fahad Algarni

References

Mehta M, Agrawal R, Rissanen J (1996) SLIQ: a fast scalable classifier for data mining. In: Advances in database technology—EDBT’96. Springer, Berlin, Heidelberg, pp 18–32
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Article MathSciNet Google Scholar
Fan W et al (1999) AdaCost: misclassification cost-sensitive boosting. In: ICML
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEE
Wu G, Chang EY (2005) KBA: Kernel boundary alignment considering imbalanced data distribution. Knowl Data Eng IEEE Trans 17(6):786–795
Article Google Scholar
Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceedings IEEE international conference on data mining, 2001. ICDM 2001. IEEE
Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinform 1(1):46–55
Google Scholar
Chawla NV et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Knowledge discovery in databases: PKDD 2003. Springer, Berlin, Heidelberg, pp 107–119
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Kennedy J (2010) Particle swarm optimization. Encyclopedia of machine learning. Springer, New York
Google Scholar
Xin-She Y (2010) A new metaheuristic bat-inspired algorithm. In: Nature inspired cooperative strategies for optimization (NICSO, 2010). Springer, Berlin, Heidelberg, pp 65–74
Ichikawa T et al (2007) High-b value diffusion-weighted MRI for detecting pancreatic adenocarcinoma: preliminary results. Am J Roentgenol 188(2):409–414
Article Google Scholar
Lichman M (2013) UCI Machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml. Accessed 11 Nov 2015
Maciej Z, Tomczak JM, Lubicz M, Witek J (2014) Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. In: Applied soft computing, vol 14, Elsevier, pp 99–108
Schierz AC (2009) Virtual screening of bioassay data. J Cheminform 1:1–21
Article Google Scholar
Chen X, Wang M, Zhang H (2011) The use of classification trees for bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):55–63
Article Google Scholar
Ma XH, Yap CW (2010) Consensus model for identification of novel PI3K inhibitors in large chemical library. J Comput-Aided Mol Des 24(2):131–141
Article Google Scholar
Tong DL, Mintram R (2010) Genetic algorithm-neural network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection. Int J Mach Learn Cybern 1(1–4):75–87
Article Google Scholar

Download references

Acknowledgments

The authors are thankful for the financial support from the research grant “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF)”, Grant No. MYRG2015-00128-FST, offered by the University of Macau, FST, and RDAO.

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Macau, Taipa, Macau SAR
Jinyan Li & Simon Fong
Department of Computer Science, Lakehead University, Taipa, Macau SAR
Sabah Mohammed & Jinan Fiaidhi

Authors

Jinyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Simon Fong
View author publications
You can also search for this author in PubMed Google Scholar
Sabah Mohammed
View author publications
You can also search for this author in PubMed Google Scholar
Jinan Fiaidhi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Fong.

Appendix

See Tables 3, 4, 5, 6 and 7.

Table 3 Results of surgery dataset

Full size table

Table 4 Results of AID 362 in Bioassay

Full size table

Table 5 Results of AID 439 in Bioassay

Full size table

Table 6 Results of AID 721 in Bioassay

Full size table

Table 7 Results of AID 1284 in Bioassay

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Fong, S., Mohammed, S. et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72, 3708–3728 (2016). https://doi.org/10.1007/s11227-015-1541-6

Download citation

Published: 16 November 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11227-015-1541-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

New cardiovascular disease prediction approach using support vector machine and quantum-behaved particle swarm optimization

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

New cardiovascular disease prediction approach using support vector machine and quantum-behaved particle swarm optimization

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation