Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Zhang, Chunkai; Zhou, Ying; Guo, Jianwei; Wang, Guoquan; Wang, Xuan

doi:10.1007/s13042-018-0853-2

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Original Article
Published: 09 July 2018

Volume 10, pages 1765–1778, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Chunkai Zhang¹,
Ying Zhou¹,
Jianwei Guo¹,
Guoquan Wang¹ &
…
Xuan Wang¹

1120 Accesses
35 Citations
Explore all metrics

Abstract

High-dimensional problems result in bad classification results because some combinations of features have an adverse effect on classification; while class-imbalanced problems make the classifier to concern the majority class more but the minority less, because the number of samples of majority class is more than minority class. The problem of both high-dimensional and class-imbalanced classification is found in many fields such as bioinformatics, healthcare and so on. Many researchers study either the high-dimensional problem or class-imbalanced problem and come up with a series of algorithms, but they ignore the above new problem, which indicates high-dimensional problems affect sampling process while class-imbalanced problems interfere feature selection. Firstly, this paper analyses the new problem arising from the mutual influence of the two problems, and then introduces SVM and analyses its advantages in dealing high-dimensional problem and class-imbalanced problem. Next, this paper proposes a new algorithm named BRFE-PBKS-SVM aimed at high-dimensional class-imbalanced datasets, which improves SVM-RFE by considering the class-imbalanced problem in the process of feature selection, and it also improves SMOTE so that the procedure of over-sampling could work in the Hilbert space with an adaptive over-sampling rate by PSO. Finally, the experimental results show the performance of this algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resampling Imbalanced Data and Impact of Attribute Selection Methods in High Dimensional Data

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

Article 08 May 2022

Xiao Li & Kewen Li

References

Provost F (2008) Machine learning from imbalanced data sets 101 (extended abstract). In: 2011 international conference of soft computing and pattern recognition (SoCPaR). IEEE, Piscataway, pp 435–439
Google Scholar
Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23:1638–1654
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Huang YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class-imbalanced problem. Nonlinear Anal Real World Appl 7:720–747
Article MathSciNet MATH Google Scholar
Wang XZ, Zhang T, Wang R (2017) Noniterative deep learning: incorporating restricted Boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst 99:1–10
Google Scholar
Bhlmann P, Sara, Van De Geer (2013) Statistics for high-dimensional data: methods, theory and applications. J Jpn Stat Soc 44:247–249
Google Scholar
Guo B, Damper RI, Gunn SR, Nelson JDB (2008) A fast separability-based feature-selection method for high-dimensional remotely sensed image classification. Pattern Recogn 41:1653–1662
Article MATH Google Scholar
Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 685–690
Google Scholar
Wang XZ, Wang R, Xu C (2017) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715
Article Google Scholar
Shen D, Shen H, Marron JS (2013) Consistency of sparse PCA in high dimension, low sample size contexts. J Multivar Anal 115:317–333
Article MathSciNet MATH Google Scholar
Zhuang X-S, Dai D-Q (2007) Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn 40:1570–1578
Article MATH Google Scholar
Arif M (2012) Similarity-dissimilarity plot for visualization of high-dimensional data in biomedical pattern classification. J Med Syst 36:1173–1181
Article Google Scholar
Imani M, Ghassemian H (2016) Binary coding based feature extraction in remote sensing high-dimensional data. Inf Sci 342:191–208
Article Google Scholar
Singh B, Kushwaha N, Vyas O-P (2014) A feature subset selection technique for high-dimensional data using symmetric uncertainty. J Data Anal Inf Process 2(4):95–105
Google Scholar
Eiamkanitchat N, Theera-Umpon N, Auephanwiriyakul S (2015) On feature selection and rule extraction for high-dimensional data: a case of diffuse large B-cell lymphomas microarrays classification. Math Probl Eng 9:1–12
Article Google Scholar
García V, Sánchez JS, Mollineda RA (2011) Classification of high dimensional and imbalanced hyperspectral imagery data. In: Iberian conference on pattern recognition and image analysis. Springer, Berlin, pp 644–651
Chapter Google Scholar
Farid DM, Nowe A, Manderick B (2016) Ensemble of trees for classifying high-dimensional imbalanced genomic data. In: Proceedings of SAI intelligent systems conference. Springer, Berlin, pp 172–187
Google Scholar
Liu Q, Lu X, He Z, Zhang C, Chen WS (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl Based Syst 134:189–198
Article Google Scholar
Gui L, Zhou Y, Xu R, He Y, Lu Q (2017) Learning representations from heterogeneous network for sentiment classification of product reviews. Knowl-Based Syst 124:34–45
Article Google Scholar
Chen T, Xu R, He Y, Wang X (2017) Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Exp Syst Appl 72:221–230
Article Google Scholar
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: IEEE international conference on data mining workshops. IEEE, Piscataway, pp 507–514
Google Scholar
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: International conference on machine learning and application. IEEE, Piscataway, pp 245–250
Google Scholar
Blagus R, Lusa L (2012) Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. Int Conf Mach Learn Appl 2:89–94
Google Scholar
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
Article Google Scholar
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B 73(3):273–282
Article MathSciNet MATH Google Scholar
Gashler M, Martinez T (2011) Temporal nonlinear dimensionality reduction. In: International joint conference on neural networks, pp 1959–1966
Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: 2015 IEEE 17th international conference on high performance computing and communications, 2015 IEEE 7th international symposium on cyberspace safety and security, and 2015 IEEE 12th international conference on embedded software and systems. IEEE, Piscataway, pp 1314–1319
Zhang C, Jia P (2014) DBBoost-enhancing imbalanced classification by a novel ensemble based technique. In: International conference on medical biometrics. IEEE, Piscataway, pp 210–215
Google Scholar
Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25:1460–1475
Article Google Scholar
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252
Article MathSciNet Google Scholar
Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067
Article Google Scholar
Zhang S, Liu L, Zhu X, Zhang C (2008) A strategy for attributes selection in cost-sensitive decision trees induction. In: International conference on computer and information technology workshops. ACM, New York, pp 8–13
Google Scholar
Guyon I, Weston J, Barnhill S (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422
Article MATH Google Scholar
Wang J, Yun B, Huang P, Liu YA (2013) Applying threshold SMOTE algoritwith attribute bagging to imbalanced datasets. In: International conference on rough sets and knowledge technology. Springer, Berlin, pp 221–228
Chapter Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887
Google Scholar
Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. Bmc Bioinformatics 14(1):106
Article Google Scholar
Kwok JT, Tsang IW (2004) The pre-image problem in kernel methods. IEEE Trans Neural Netw 15(6):1517–1525
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Chang C-C, Lin C-J (2011) Libsvm. ACM Trans Intell Syst Technol TIST 2(3):27
Google Scholar

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2016YFB0800900).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen, China
Chunkai Zhang, Ying Zhou, Jianwei Guo, Guoquan Wang & Xuan Wang

Authors

Chunkai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Guoquan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunkai Zhang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Zhou, Y., Guo, J. et al. Research on classification method of high-dimensional class-imbalanced datasets based on SVM. Int. J. Mach. Learn. & Cyber. 10, 1765–1778 (2019). https://doi.org/10.1007/s13042-018-0853-2

Download citation

Received: 09 November 2017
Accepted: 06 June 2018
Published: 09 July 2018
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s13042-018-0853-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Abstract

Access this article

Similar content being viewed by others

Resampling Imbalanced Data and Impact of Attribute Selection Methods in High Dimensional Data

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Resampling Imbalanced Data and Impact of Attribute Selection Methods in High Dimensional Data

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation