Abstract
Most research on Twitter classification is focused on tweets in English. But Twitter supports over 40 languages and about 50% of tweets are non-English tweets. To fully use the Twitter contents, it is important to develop classifiers that can classify multilingual tweets or tweets of mixed languages (for example tweets mainly in Chinese but containing English words). The translation-based model is a classical approach to achieving multilingual or cross-lingual text classification. Recently character-based neural models are shown to be effective for text classification. But they are designed for limited European languages and require identification of languages to build an alphabet to encode and quantize characters. In this paper, we propose UniCNN (Unicode character Convolutional Networks), a fully language-independent character-based CNN model for the classification of tweets in multiple languages and mixed languages, not requiring language identification. Specifically, we propose to encode the sequence of characters in a tweet into a sequence of numerical UTF-8 codes, and then train a character-based CNN classifier. In addition, a character-based embedding layer is included before the convolutional layer for learning distributed character representation. We conducted experiments on Twitter datasets for multilingual sentiment classification in six languages and for mixed-language informativeness classification in over 40 languages. Our experiments showed that UniCNN mostly performed better than state-of-the-art neural models and traditional feature-based models, while not requiring the extra burden of any translation or tokenization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorow.org/
Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45175-4_13
Chollet, F., et al.: Keras. (2015). https://github.com/fchollet/keras
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: EACL (2017)
Cui, L., Zhang, X., Qin, A., Sellis, T., Wu, L.: CDS: collaborative distant supervision for Twitter account classification. Exp. Syst. Appl. 83, 94–103 (2017)
Denecke, K.: Using SentiWordNet for multilingual sentiment analysis. In: ICDEW. IEEE (2008)
Dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: COLING (2014)
Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: NIPS (2016)
Giachanou, A., Crestani, F.: Like it or not: a survey of Twitter sentiment analysis methods. ACM Comput. Surv. (CSUR) 49, 1–41 (2016)
Gillick, D., Brunk, C., Vinyals, O., Subramanya, A.: Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103 (2015)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)
Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058 (2014)
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014)
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI (2016)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. In: TACL (2017)
Mozetič, I., Grčar, M., Smailović, J.: Multilingual Twitter sentiment classification: the role of human annotators. PloS ONE 11, e0155036 (2016)
Narr, S., Hulfenhaus, M., Albayrak, S.: Language-independent Twitter sentiment analysis. In: KDML (2012)
Olteanu, A., Vieweg, S., Castillo, C.: What to expect when the unexpected happens: social media communications across crises. In: CSCW. ACM (2015)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Wehrmann, J., Becker, W., Cagnini, H.E., Barros, R.C.: A character-based convolutional neural network for language-agnostic Twitter sentiment analysis. In: IJCNN. IEEE (2017)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1, 80–83 (1945)
Yang, Z., Dhingra, B., Yuan, Y., Hu, J., Cohen, W.W., Salakhutdinov, R.: Words or characters? Fine-grained gating for reading comprehension. In: ICLR (2017)
Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Zhou, X., Wan, X., Xiao, J.: Attention-based LSTM network for cross-lingual sentiment classification. In: EMNLP (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, S., Zhang, X., Chan, J. (2017). Language-Independent Twitter Classification Using Character-Based Convolutional Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-69179-4_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)