Abstract
Imbalanced training data poses a serious problem for supervised learning based text classification. Such a problem becomes more serious in emotion classification task with multiple emotion categories as the training data can be quite skewed. This paper presents a novel over-sampling method to form additional sum sentence vectors for minority classes in order to improve emotion classification for imbalanced data. Firstly, a large corpus is used to train a continuous skip-gram model to form each word vector using word/POS pair as the unit of word vector. The sentence vectors of the training data are then constructed as the sum vector of their word/POS vectors. The new minority class training samples are then generated by randomly add two sentence vectors in the corresponding class until the training samples for each class are the same so that the classifiers can be trained on fully balanced training dataset. Evaluations on NLP&CC2013 Chinese micro blog emotion classification dataset shows that the obtained classifier achieves 48.4% average precision, an 11.9 percent improvement over the state-of-art performance on this dataset (at 36.5%). This result shows that the proposed over-sampling method can effectively address the problem of data imbalance and thus achieve much improved performance for emotion classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Turney, P.-D.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: Proceedings of ACL 2002, pp. 417–424 (2002)
Kamps, J., Marx, M., Mokken, R.-J., de Rijke, M.: Using WordNet to Measure Semantic Orientation of Adjectives. In: Proceedings of LREC 2004, pp. 1115–1118 (2004)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Proceedings of EMNLP 2002, pp. 79–86 (2002)
Gu, X.-J., Wang, Z.-L., Liu, J.-W., Liu, S.: Research on Modeling Artificial Psychology Based on HMM. Application Research of Computers 12, 30–32 (2006)
Quan, C., Ren, F.: Construction of a Blog Emotion Corpus for Chinese Emotional Expression Analysis. In: Proceedings of EMNLP 2009, pp. 1446–1454 (2009)
Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6(1), 1–6 (2004)
Zhou, Z.-H., Liu, X.-Y.: Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. Knowledge and Data Engineering 18(1), 63–77 (2006)
Ertekin, S., Huang, J., Bottou, L., Giles, C.-L.: Learning on the Border: Active Learning in Imbalanced Data Classification. In: Proceedings of CIKM 2007 (2007)
Chen, T., Xu, R., Wu, M., Liu, B.: A Sentiment Classification Approach based on Sentiment Sentence Framework. Journal of Chinese Information Processing 27(5), 67–74 (2013)
Ren, J.-W., Yang, Y., Wang, H., Lin, H.: Construction of the Binary Affective Commonsense Knowledgebase and its Application in Text Affective Analysis. China Science Paper Online (2013), http://www.paper.edu.cn/releasepaper/content/201301-158
Longadge, R., Dongre, S.-S., Malik, L.: Class Imbalance Problem in Data Mining Review. International Journal of Computer Science and Network 2(1), 1305–1707 (2013)
Wang, Z.-Q., Li, S.-S., Zhu, Q.-M., Li, P.-F., Zhou, G.-D.: Chinese Sentiment Classification on Imbalanced Data Distribution. Journal of Chinese Information Processing 26(3), 33–37 (2012)
Deerwester, S., Dumais, S.-T., Furnas, G.-W., Landauer, T.-K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Bellegarda, J.-R.: A Latent Semantic Analysis Framework for Large–span Language Modeling. In: Proceedings of Eurospeech 1997, pp. 1451–1454 (1997)
Blei, D.-M., Ng, A.-Y., Jordan, M.-I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Riis, S., Krogh, A.: Improving Protein Secondary Structure Prediction using Structured Neural Networks and Multiple Sequence Profiles. Journal of Computational Biology, 163–183 (1996)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. In: Proceedings of ICLR Workshop (2013)
Han, J., Kamber, M.: Data mining: Concepts and Technique. Morgan Kaufman, San Francisco (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, T. et al. (2014). A Sentence Vector Based Over-Sampling Method for Imbalanced Emotion Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-54903-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)