Skip to main content
Log in

Orthographic features for emotion classification in Chinese in informal short texts

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Informal short texts on the web are rich in emotions as they often reflect unfiltered immediate reactions to breaking news events. The emotion density, however, stands in contrast to its poverty of linguistic contexts and features for emotion classification. This paper tackles that challenge by proposing orthographic features based on orthographic code mixing and code-switching for both non-ML and ML approaches. Our results show that orthographic features routinely outperform grammatical features for emotion classification for short texts in all approaches as expected. Orthographic features were also shown to make more significant contributions, especially in terms of precision and in formal texts when state of the art deep learning algorithms are applied. This result confirms the effectiveness of the orthographic change feature to the task of emotion classification. These results are argued to be applicable to all languages because of the common code-shifting in languages with non-Latin orthographies, and the use of non-letter symbols in all languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Source website: https://www.bilibili.com/ and the datasets included in the: https://github.com/yunfeilongpoly/ource_for_Orthographic_features_for_Emotion_Classification_in_Chinese.

  2. First, we segment each text file into words (for English splitting by space), and then we count times for each word that occurs in each document. Afterwards, we assign each word an integer id. Each unique word in our dictionary corresponds to a feature (descriptive feature) that we can use tf-idf to further reduce the weight of high-frequency words such as the, is, an, which occur in all documents.

  3. https://www.ltp-cloud.com/demo/.

References

  • Auer, P. (2013). Code-switching in conversation: Language, interaction and identity. New York: Routledge.

    Book  Google Scholar 

  • Balamurali, A. R., Joshi, A., & Bhattacharyya, P. (2011). Harnessing wordnet senses for supervised sentiment classification. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics pp. 1081–1091.

  • Barbosa, L., & Feng, J. (2010). Robust sentiment detection on twitter from biased and noisy data. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics. COLING ’10, USA, pp. 36–44.

  • Benamara, Farah, Cesarano, Carmine, Picariello, Antonio, Reforgiato, Diego, & Subrahmanian, V. S. (2007). Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the 1st International Conference on Weblogs and Social Media. Short paper.

  • Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.

  • Bond, M. H., & Tat-Ming, L. (1986). Embarrassment and code-switching into a second language. Journal of Social Psychology, 126(2), 179–186.

    Google Scholar 

  • Chan, S. D. (2016) Punctuation. In Chu-Ren Huang and Dingxu Shi. A reference grammar of Chinese., pp. 128–135. Cambridge University Press.

  • Chang, Chih-Chung, & Lin, Chih-Jen. (2011). Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 27.

    Google Scholar 

  • Che, Wanxiang, Spitkovsky, Valentin I., & Liu, Ting. (2012). A comparison of chinese parsers for stanford dependencies. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 11–16.

  • Chen, Y., Yat M. L., Sophia, L., Shoushan R., Huang, C. R. (2010). Emotion cause detection with linguistic constructions. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 179–187.

  • Chen, I. H., Long, Y., Lu, Q., & Huang, C. R. (2019). Metaphordetection: Leveraging culturally grounded eventive information. IEEE Access, 7:10987–10998.

  • Chen, Huimin, Sun, Maosong, Tu, Cunchao, Lin, Yankai, & Liu, Zhiyuan. (2016). Neural sentiment classification with user and product attention. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1650–1659.

  • Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Schwenk, Fethi, Holger , & Bengio, Yoshua. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078.

  • Chou, Y. M., & Huang, C. R. (2006). Hantology-a linguistic resource for chinese language processing and studying. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy

  • Chou, Y. M., & Huang, C. R. (2010). Hantology: conceptual system discovery based on orthographic convention. ontology and the lexicon: A natural language processing perspective, pp. 122–143.

  • Clyne, M. (2000) Constraints on code-switching: How universal are they. The bilingualism reader, pp. 257–280.

  • Cromdal, J. (2001). Overlap in bilingual play: Some implications of code-switching for overlap resolution. Research on Language and Social Interaction, 34(4), 421–451.

    Article  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dewaele, J. M. (2004). Blistering barnacles! what language do multilinguals swear in?!. Estudios de Sociolinguistica, 5, 83–105.

    Google Scholar 

  • Dewaele, J. (2010). Emotions in multiple languages. Berlin: Springer.

    Book  Google Scholar 

  • Ding, H., Zhang, Y., Liu, H., & Huang, C. R. (2017). A preliminary phonetic investigation of alphabetic words in mandarin chinese. In Interspeech 2017, Stockholm, Sweden, Aug 20-24.

  • Dos S., Cícero N. & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, pp. 69–78.

  • Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J., Reyes, A. (2015) SemEval-2015 task 11: Sentiment analysis of figurative language in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, pp. 470–478.

  • Go, Alec, Bhayani, Richa, & Huang, Lei. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009):12.

  • Halim, Nur Syazwani & Maros, Marlyna. (2014). The functions of code-switching in facebook interactions. Procedia-Social and Behavioral Sciences, 118, 126–133.

  • Hartmann, S., Choudhury, M., & Bali, K. (2018). An Integrated Representation of Linguistic and Social Functions of Code-Switching. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, May 7-12, 2018 2018.

  • Hatzivassiloglou, V., McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 174–181.

  • Hatzivassiloglou, V., Wiebe, J. M. (2000). Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the 18th conference on Computational linguistics-Volume 1. Association for Computational Linguistics, pp. 299–305.

  • Heredia, R. R., & Altarriba, J. (2001). Bilingual language mixing: Why do bilinguals code-switch? Current Directions in Psychological Science, 10(5), 164–168.

    Article  Google Scholar 

  • Hou, R., Huang, C. R., & Liu, H. (2019). A study on chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory, 15(1), 1–37.

    Article  Google Scholar 

  • Huang, C. R., Hong, W. M., & Chen, K. J. (1994). Suoxie: An information based lexical rule of abbreviation. Proceedings of the Second Pacific Asia Conference on Formal and Computational Linguistics, II, 49–52.

    Google Scholar 

  • Huang, C. R., & Liu, H. (2017). Corpus-based automatic extraction and analysis of mandarin alphabetic words (in chinese). Bulletin of Yunan Normal University: Philosophy and Social Sciences, 3, 10–21.

    Google Scholar 

  • Huang, C. R., & Shi, D. (2016). A reference grammar of Chinese. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Huang, C. R., Ahrens, K., Becker, T., Llamas, R., Tam, K., & Meisterernst, B. (2019) Chinese language art: The role of language and linguistic devices in literary and artistic expressions. Routledge Handbook on Chinese Applied Linguistics.

  • Huang, C. R., Calzolari, N., Gangemi, A., Lenci, A., & Oltramari, A. (2010). and Laurent Prevot. Ontology and the lexicon: A natural language processing perspective. Cambridge University Press.

  • Huang, C. R. (2009a). Tagged Chinese Gigaword Version 2.0, LDC2009T14. Linguistic Data Consortium.

  • Huang, C. R. (2009b). Tagged chinese gigaword version 2.0. philadelphia: Lexical data consortium, university of pennsylvania. Technical report, ISBN 1-58563-516-2.

  • Huang, C. R., & Chou, Y. M. (2015). Multilingual conceptual access to lexicon based on shared orthography: An ontology-driven study of chinese and japanese. In Language Production, Cognition, and the Lexicon. Springer, pp. 135–150.

  • Joshi, N. S., & Itkat, S. A. (2014). A survey on feature level sentiment analysis. International Journal of Computer Science and Information Technologies, 5:5422–5425.

  • Joshi, A., Prabhu, A., Shrivastava, M., & Varma, V. (2016). Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text. In Proceedings of (COLING) 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, pp. 2482–2491.

  • Kharkhurin, A. V., & Wei, L. (2015). The role of code-switching in bilingual creativity. International Journal of Bilingual Education and Bilingualism, 18(2), 153–169.

    Article  Google Scholar 

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October. Association for Computational Linguistics, pp. 1746–1751.

  • Lazaridou, A., Marelli, M., Zamparelli, R., & Baroni, M. (2013). Compositional-ly derived representations of morphologically complex words in distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1517–1526.

  • LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. et al. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  • Lee, S. Y. (2019). Emotion and cause: Linguistic theory and computational implementation. Berlin: Springer.

    Book  Google Scholar 

  • Lee, S. Y., Chen, Y., Huang, C. R., & Li, S. (2013). Detecting emotion causes with a linguistic rule-based approach 1. Computational Intelligence, 29(3), 390–416.

    Article  Google Scholar 

  • Lee, S. Y., Chen, Y., Li, S., & Huang, C. R. (2012). Emotion cause events: Corpus construction and analysis. (Vol. 2010).

  • Lee, S., & Wang, Z. (2015). Emotion in code-switching texts: Corpus construction and analysis. In Proceedings of the 8th SIGHAN Workshop on Chinese Language Processing, pp. 91–99

  • Li, S., Lee, S. Y., Liu, H., & Huang, C. R. (2013). Implicit emotion classification with the context of emotion related event. Journal of Chinese information processing, 27(6), 90–95.

    Google Scholar 

  • Li, M., Qin, L., Long, Y., & Gui, L. (2017). Inferring affective meanings of words from word embedding. IEEE Transactions on Affective Computing, 2(1), 1–1.

    Google Scholar 

  • Lin, C., & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 3rd ACM conference on Information and knowledge management. ACM, pp. 375–384.

  • Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In Mining text data, pp. 415–463. Springer.

  • Long, Y., Qin, L., Xiang, R., Li, M., & Huang, C. R. (2017). A cognition based attention model for sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 473–482.

  • Long, Y., Xiang, R., Lu, Q., Huang, C. R., & Li, M. (2019). Improving attention model based on cognition grounded data for sentiment analysis. IEEE transactions on affective computing.

  • Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, pp. 142–150.

  • Maharjan, S., Blair, E., Bethard, S., & Solorio, T. (2015). Developing language-tagged corpora for code-switching tweets. In Proceedings of The 9th Linguistic Annotation Workshop, pp. 72–84.

  • Martineau, J., & Finin, T. (2009). Delta tfidf: An improved feature space for sentiment analysis. Proceedings of the 18th International Conference on Weblogs and Social Media, 9:106

  • Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. X. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international conference on World Wide Web. ACM, pp. 171–180.

  • Mishra, A., Dey, K., & Bhattacharyya, P. (2017). Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 377–387.

  • Mullen, T., & Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. In Proceedings of the 2004 conference on empirical methods in natural language processing, volume 4, pp. 412–418.

  • Musk, N. (2012). Performing bilingualism in wales. Pragmatics, 22(4), 651–669.

    Article  Google Scholar 

  • Nakagawa, T., Inui, K., & Kurohashi, S. (2010). Dependency tree-based sentiment classification using crfs with hidden variables. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 786–794.

  • Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., & Wilson, T. (2013). SemEval-2013 task 2: Sentiment analysis in twitter. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics, Atlanta, Georgia, USA, June 2013, pp. 312–320.

  • Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.

    Article  Google Scholar 

  • Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, pp. 79–86.

  • Pujolar, J. (2001). Gender, heteroglossia and power: A sociolinguistic study of youth culture (Vol. 4). : Walter de Gruyter.

  • Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., Stoyanov, V. (2015). Semeval-2015 task 10: Sentiment analysis in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 451–463.

  • Rudra, K., Rijhwani, S., Begum, R., Choudhury, M., Bali, K., & Ganguly, N. (2016). Understanding language preference for expression of opinion and sentiment: What do hindi-english speakers do on twitter? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-2016), pp. 1131–1141.

  • Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp. 151–161.

  • Tang, D., Qin, B., & Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432.

  • Tong, R. (2001). An operational system for detecting and tracking opinions in on-line discussions. In Working Notes of the SIGIR Workshop on Operational Text Classification, New Orleans, Louisianna, pp. 1–6.

  • Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp. 417–424.

  • Vijay, D., Bohra, A., Singh, V., Akhtar, S. S., & Shrivastava, M. (2018). Corpus creation and emotion prediction for Hindi-English code-mixed social media text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, New Orleans, Louisiana, USA, pp. 128–135.

  • Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2016). En-es-cs: An english-spanish code-switching twitter corpus for multilingual sentiment analysis. In Proceedings of the 10th International Conference on Language Resources and Evaluation.

  • Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2015). Sentiment analysis on monolingual, multilingual and code-switching twitter corpora. In Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 2–8.

  • Wang, Z., Lee, S. Y., Li, S., & Zhou, G. (2017). Emotion analysis in code-switching text with joint factor graph model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 469–480.

    Article  Google Scholar 

  • Wang, Z., Lee, S., Li, S., & Zhou, G. (2015). Emotion detection in code-switching texts via bilingual and sentimental information. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 763–768.

  • Wiebe, J., Bruce, R., Bell, M., Martin, M., & Wilson, T. (2001). A corpus study of evaluative and speculative language. In Proceedings of the Second SIGdial Workshop on Discourse and Dialogue-Volume 16. Association for Computational Linguistics, pp. 1–10.

  • Wiebe, J. (2000) Learning subjective adjectives from corpora. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press, pp. 735–740.

  • Xu, H., Santus, E., Laszlo, A., Huang, C. R. (2015). Llt-polyu: Identifying sentiment intensity in ironic tweets. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 673–678.

  • Xu, R., Zou, C., Xu, J., & Lu, Q. (2013). Reader’s emotion prediction based on partitioned latent dirichlet allocation model. In Proceedings of of International Conference on Internet Computing and Big Data.

  • Xu, G., Meng, X., & Wang, H. (2010). Build chinese emotion lexicons using a graph-based algorithm and multiple resources. In Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp. 1209–1217.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chu-Ren Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, IH., Long, Y., Lu, Q. et al. Orthographic features for emotion classification in Chinese in informal short texts. Lang Resources & Evaluation 55, 329–352 (2021). https://doi.org/10.1007/s10579-020-09515-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09515-3

Keywords

Navigation