Abstract
The category imbalance of data in text sentiment classification is a widely existent phenomenon, and it is a serious challenge for designing an effective classifier. In this paper, we propose a two-stage data balancing scheme for text sentiment classification, which not only can make the data boundary clear, but also can balance the class distribution of training data set. The core algorithm LDMRC of the scheme is proposed based on the shortest distance from a point to a straight line, to remove some majority class texts in the neighborhood of a minority class text for balancing the class distribution of data in the local dense mixed region. The second stage employs SS or RS as a data rebalancing strategy to globally balance the training dataset after local dense mixed region cutting. The proposed two-stage data balancing scheme is used by situating at the front of a learning algorithm such as SVM. Using the machine learning algorithm SVM on eight imbalanced data sets including Book_c, Hotel, Jadeite, Insurance in Chinese, and DVD, Book_e, Electronics, Kitchen in English, we verify the effectiveness of the proposed method. The experimental results show that LDMRC is superior to the best existing cutting algorithm BRC for Acc, RN and FN. Furthermore, LDMRC+SS and LDMRC+RS are superior to the corresponding method LDMRC on Chinese datasets. This indicates that alone use of local boundary cutting cannot obtain the best effect, and data rebalancing strategies are necessary for text sentiment classification.























Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Xu RF, Chen T, Xia YQ (2015) Word embedding composition for data imbalances in sentiment and emotion classification. Cogn Comput 7:226–240
Wang S, Li D, Zhao L, Zhang J (2013) Sample cutting method for imbalanced text sentiment classification based on BRC. Knowl Based Syst 37:451–461
Mountassir A, Benbrahim H, Berrada I (2012) An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Proceedings of 2012 IEEE international conference on systems, man, and cybernetics, pp 14–17
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Junsomboon N, Phienthrakul T (2017) Combining over sampling and under sampling techniques for imbalance dataset. In: Proceedings of international conference on machine learning and computing, pp 243–247
Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340
Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W (2002) Smote: synthetic minority over sampling technique. J Artif Intell Res 16(1):321–357
Krawczyk B, Galar M, Jelen L, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726
Lin WC, Tsai CF, Hu YH (2017) Clustering based undersampling in class-imbalanced data. Inf Sci 409:17–26
Maldonado S, Weber R, Famili F (2014) Feature selection for high dimensional class imbalanced data sets using support vector machines. Inf Sci 286:228–246
Liu Z, Wang R, Tao M (2015) A class oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion. Neurocomputing 168:365–381
Cao P, Liu XI, Zhang J (2017) L2,1 norm regularized multi kernel based joint nonlinear feature selection and over sampling for imbalanced data classification. Neurocomputing 234:38–57
Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. Sigkdd Explor 6(1):80–89
Li F, Zhang X, Zhang X (2018) Cost sensitive and hybrid attribute measure multi decision tree over imbalanced data sets. Inf Sci 422:242–256
Manevitz LM, Yousef M (2002) One-class SVMS for document classification. J Mach Learn Res 2(1):139–154
Li S, Zhou G, Wang Z, Lee SYM, Wang R (2011) Imbalanced sentiment classification. In: Proceedings of ACM international conference on information and knowledge management, pp 2469–2472
Zhang MD, Ma J (2016) An ensemble method for unbalanced sentiment classification. pp 440–445
Vinodhini G, Chandrasekaran RM (2017) A sampling based sentiment mining approach fore commerce applications. Inf Process Manag 53:223–236
Guo H, Li Y, Shang J (2016) Learning from class imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of ICML-97, pp 179–186
Yen S, Lee Y (2009) Cluster based under sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Li S, Wang Z, Zhou G, Lee SYM (2009) Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of Pacific-Asia conference on advances in knowledge discovery and data mining, pp 588–595
Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using SVM: a comparative study. Decis Support Syst 48(1):191–201
Drummond C, Holte RC (2003) Class imbalance, and cost sensitivity: why under sampling beats over sampling. In: Proceedings of working notes ICML workshop learn, imbalanced data sets, Washington DC, pp 1–8
Hong C, Xiaoli L, Yewkwong WD, Seekiong N (2013) Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng 25(12):2809–2822
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based yynthetic minority over-sampling technique. Appl Intell 36(3):664–684
Liu XY, Wu J, Zhou ZH (2006) Exploratory under-sampling for class-imbalance learning. In: Proceedings of international conference on data mining, pp 965–969
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 99:1–15
Ghazikhani A, Monsefi R, Yazdi HS (2013) Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122:535–544
Castro CL, Braga AP (2013) Novel cost sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899
Barandela R, Snchez JS, GarciA V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36(3):849–851
Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443
Wagstaff KL, Lanza NL, Thompson DR, Dietterich TG, Gilmore MS (2013) Guiding scientific discovery with explanations using demud. In: Proceedings of twenty-seventh AAAI conference on artificial intelligence, pp 905–911
Yessenalina A, Yue Y, Cardie C (2010) Multi level structured models for document level sentiment classification. In: Proceedings of conference on empirical methods in natural language processing, pp 1046–1056
Wang H, Yin P, Yao J, Liu JN (2013) Document feature selection for sentiment classification of Chinese online reviews. J Exp Theor Artif Intell 25:425–439
Wang G, Sun J, Ma J, Kaiquan X, Jibao G (2014) Sentiment classification: the contribution of ensemble learning. Decis Support Syst 57(1):77–93
Rodrigo M, Francisco VJ, Gavio Neto Wilson P (2013) Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst Appl 40(2):621–633
Zhang D, Hua X, Zengcai S, Yunfeng X (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863
Li S, Ju S, Zhou G, Li X (2012) Active learning for imbalanced sentiment classification. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning, pp 139–148
Prusa J, Khoshgoftaar TM, Dittman DJ, Napolitano A (2015) Using random undersampling to alleviate class imbalance on tweet sentiment data. In: Proceedings of IEEE international conference on information reuse and integration, pp 197–202
Wang S, Li D, Song X, Wei Y, Li H (2011) A feature selection method based on improved Fisher’s discriminant ratio for text sentiment classification. Expert Syst Appl 38(7):8696–8702
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of conference on empirical methods in natural language processing, pp 79–86
Tan S, Zhang J (2008) An empirical study of sentiment analysis for Chinese documents. Expert Syst Appl 34(4):2622–2629
Acknowledgements
The authors would like to thank all anonymous reviewers. The works described in this paper are supported by the National Natural Science Foundation of China (NSFC nos. 61632011, 61573231, 61432011, 61672331, U1435212).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Wang, J., Wang, S. et al. Local dense mixed region cutting + global rebalancing: a method for imbalanced text sentiment classification. Int. J. Mach. Learn. & Cyber. 10, 1805–1820 (2019). https://doi.org/10.1007/s13042-018-0858-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-018-0858-x