Skip to main content
Log in

Local dense mixed region cutting + global rebalancing: a method for imbalanced text sentiment classification

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The category imbalance of data in text sentiment classification is a widely existent phenomenon, and it is a serious challenge for designing an effective classifier. In this paper, we propose a two-stage data balancing scheme for text sentiment classification, which not only can make the data boundary clear, but also can balance the class distribution of training data set. The core algorithm LDMRC of the scheme is proposed based on the shortest distance from a point to a straight line, to remove some majority class texts in the neighborhood of a minority class text for balancing the class distribution of data in the local dense mixed region. The second stage employs SS or RS as a data rebalancing strategy to globally balance the training dataset after local dense mixed region cutting. The proposed two-stage data balancing scheme is used by situating at the front of a learning algorithm such as SVM. Using the machine learning algorithm SVM on eight imbalanced data sets including Book_c, Hotel, Jadeite, Insurance in Chinese, and DVD, Book_e, Electronics, Kitchen in English, we verify the effectiveness of the proposed method. The experimental results show that LDMRC is superior to the best existing cutting algorithm BRC for Acc, RN and FN. Furthermore, LDMRC+SS and LDMRC+RS are superior to the corresponding method LDMRC on Chinese datasets. This indicates that alone use of local boundary cutting cannot obtain the best effect, and data rebalancing strategies are necessary for text sentiment classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

  1. Xu RF, Chen T, Xia YQ (2015) Word embedding composition for data imbalances in sentiment and emotion classification. Cogn Comput 7:226–240

    Article  Google Scholar 

  2. Wang S, Li D, Zhao L, Zhang J (2013) Sample cutting method for imbalanced text sentiment classification based on BRC. Knowl Based Syst 37:451–461

    Article  Google Scholar 

  3. Mountassir A, Benbrahim H, Berrada I (2012) An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Proceedings of 2012 IEEE international conference on systems, man, and cybernetics, pp 14–17

  4. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  5. Junsomboon N, Phienthrakul T (2017) Combining over sampling and under sampling techniques for imbalance dataset. In: Proceedings of international conference on machine learning and computing, pp 243–247

  6. Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340

    Article  Google Scholar 

  7. Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W (2002) Smote: synthetic minority over sampling technique. J Artif Intell Res 16(1):321–357

    Article  MATH  Google Scholar 

  8. Krawczyk B, Galar M, Jelen L, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726

    Article  Google Scholar 

  9. Lin WC, Tsai CF, Hu YH (2017) Clustering based undersampling in class-imbalanced data. Inf Sci 409:17–26

    Article  Google Scholar 

  10. Maldonado S, Weber R, Famili F (2014) Feature selection for high dimensional class imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  11. Liu Z, Wang R, Tao M (2015) A class oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion. Neurocomputing 168:365–381

    Article  Google Scholar 

  12. Cao P, Liu XI, Zhang J (2017) L2,1 norm regularized multi kernel based joint nonlinear feature selection and over sampling for imbalanced data classification. Neurocomputing 234:38–57

    Article  Google Scholar 

  13. Zheng Z, Wu X, Srihari RK (2004) Feature selection for text categorization on imbalanced data. Sigkdd Explor 6(1):80–89

    Article  Google Scholar 

  14. Li F, Zhang X, Zhang X (2018) Cost sensitive and hybrid attribute measure multi decision tree over imbalanced data sets. Inf Sci 422:242–256

    Article  Google Scholar 

  15. Manevitz LM, Yousef M (2002) One-class SVMS for document classification. J Mach Learn Res 2(1):139–154

    MATH  Google Scholar 

  16. Li S, Zhou G, Wang Z, Lee SYM, Wang R (2011) Imbalanced sentiment classification. In: Proceedings of ACM international conference on information and knowledge management, pp 2469–2472

  17. Zhang MD, Ma J (2016) An ensemble method for unbalanced sentiment classification. pp 440–445

  18. Vinodhini G, Chandrasekaran RM (2017) A sampling based sentiment mining approach fore commerce applications. Inf Process Manag 53:223–236

    Article  Google Scholar 

  19. Guo H, Li Y, Shang J (2016) Learning from class imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Google Scholar 

  20. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of ICML-97, pp 179–186

  21. Yen S, Lee Y (2009) Cluster based under sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  MathSciNet  Google Scholar 

  22. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    Article  MATH  Google Scholar 

  23. Li S, Wang Z, Zhou G, Lee SYM (2009) Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of Pacific-Asia conference on advances in knowledge discovery and data mining, pp 588–595

  24. Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using SVM: a comparative study. Decis Support Syst 48(1):191–201

    Article  Google Scholar 

  25. Drummond C, Holte RC (2003) Class imbalance, and cost sensitivity: why under sampling beats over sampling. In: Proceedings of working notes ICML workshop learn, imbalanced data sets, Washington DC, pp 1–8

  26. Hong C, Xiaoli L, Yewkwong WD, Seekiong N (2013) Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng 25(12):2809–2822

    Article  Google Scholar 

  27. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based yynthetic minority over-sampling technique. Appl Intell 36(3):664–684

    Article  Google Scholar 

  28. Liu XY, Wu J, Zhou ZH (2006) Exploratory under-sampling for class-imbalance learning. In: Proceedings of international conference on data mining, pp 965–969

  29. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421

    Article  MathSciNet  MATH  Google Scholar 

  30. Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 99:1–15

    Google Scholar 

  31. Ghazikhani A, Monsefi R, Yazdi HS (2013) Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122:535–544

    Article  Google Scholar 

  32. Castro CL, Braga AP (2013) Novel cost sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899

    Article  Google Scholar 

  33. Barandela R, Snchez JS, GarciA V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36(3):849–851

    Article  Google Scholar 

  34. Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443

    Article  MATH  Google Scholar 

  35. Wagstaff KL, Lanza NL, Thompson DR, Dietterich TG, Gilmore MS (2013) Guiding scientific discovery with explanations using demud. In: Proceedings of twenty-seventh AAAI conference on artificial intelligence, pp 905–911

  36. Yessenalina A, Yue Y, Cardie C (2010) Multi level structured models for document level sentiment classification. In: Proceedings of conference on empirical methods in natural language processing, pp 1046–1056

  37. Wang H, Yin P, Yao J, Liu JN (2013) Document feature selection for sentiment classification of Chinese online reviews. J Exp Theor Artif Intell 25:425–439

    Article  Google Scholar 

  38. Wang G, Sun J, Ma J, Kaiquan X, Jibao G (2014) Sentiment classification: the contribution of ensemble learning. Decis Support Syst 57(1):77–93

    Article  Google Scholar 

  39. Rodrigo M, Francisco VJ, Gavio Neto Wilson P (2013) Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst Appl 40(2):621–633

    Article  Google Scholar 

  40. Zhang D, Hua X, Zengcai S, Yunfeng X (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863

    Article  Google Scholar 

  41. Li S, Ju S, Zhou G, Li X (2012) Active learning for imbalanced sentiment classification. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning, pp 139–148

  42. Prusa J, Khoshgoftaar TM, Dittman DJ, Napolitano A (2015) Using random undersampling to alleviate class imbalance on tweet sentiment data. In: Proceedings of IEEE international conference on information reuse and integration, pp 197–202

  43. Wang S, Li D, Song X, Wei Y, Li H (2011) A feature selection method based on improved Fisher’s discriminant ratio for text sentiment classification. Expert Syst Appl 38(7):8696–8702

    Article  Google Scholar 

  44. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of conference on empirical methods in natural language processing, pp 79–86

  45. Tan S, Zhang J (2008) An empirical study of sentiment analysis for Chinese documents. Expert Syst Appl 34(4):2622–2629

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers. The works described in this paper are supported by the National Natural Science Foundation of China (NSFC nos. 61632011, 61573231, 61432011, 61672331, U1435212).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suge Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, J., Wang, S. et al. Local dense mixed region cutting + global rebalancing: a method for imbalanced text sentiment classification. Int. J. Mach. Learn. & Cyber. 10, 1805–1820 (2019). https://doi.org/10.1007/s13042-018-0858-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0858-x

Keywords

Navigation