Abstract
Traditional text classification methods assume that dataset is balanced. But, in the real world, there are plenty of imbalanced data on which traditional classification methods could not get satisfactory results. Based on comprehensive analysis of existing researches on imbalanced data classification, we propose a data rebalance method based on weighted sampling. The method assigns weights to each class by calculating the ratio between different categories. Then, each class is sampled with different ratios using weighted sampling methods. Experimental results on real Chinese text data set show that the proposed method can effectively improve the classification accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The corpus is available at http://www.nlpir.org/download.
- 2.
- 3.
References
Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv. Intell. Comput. Int. Conf. Intell. Comput. 3644, 878–887 (2005)
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. (IEEE TKDE) 25(1), 206–219 (2013)
Gustavo, E.A., Batista, P.A., Ronaldo, C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Ashish, A., Ganesan, P., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5), 1385–1391 (2010)
Chumphol, B., Krung, S., Chidchanok, L.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3), 664–684 (2012)
Atlántida, S., Eduardo, M., Jesus, A.G.: Synthetic oversampling of instances using clustering. Int. J. Artif. Intell. Tools 22(2) (2013)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 22(2), 1–21 (2012)
Luengo, J., Fernandez, A., Garcia, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Cardie, C., Howe, N.: Improving minority class predication using case-specific feature weights. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp. 57–65 (1997)
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)
Furnkranz, J.: Round Robin classification. J. Mach. Learn. Res. 2, 721–747 (2002)
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
Acknowledgements
The authors’ work was sponsored by National Program on Key Basic Research Project (973 Program) of China (2013CB329601, 2013CB329602), National High Technology Research and Development Program (863 Program) of China (2011AA010702, 2012AA01A401 and 2012AA01A402), the Nature Science Foundation of China (60933005, 91124002), Support Science and Technology Project of China (2012BAH38B04, 2012BAH38B06), China Postdoctoral Science Foundation Program (2012M520114).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, H., Zou, P., Han, W., Xia, R. (2014). Imbalanced Chinese Text Classification Based on Weighted Sampling. In: Yuan, Y., Wu, X., Lu, Y. (eds) Trustworthy Computing and Services. ISCTCS 2013. Communications in Computer and Information Science, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43908-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-43908-1_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43907-4
Online ISBN: 978-3-662-43908-1
eBook Packages: Computer ScienceComputer Science (R0)