Language-Independent Twitter Classification Using Character-Based Convolutional Networks

Zhang, Shiwei; Zhang, Xiuzhen; Chan, Jeffrey

doi:10.1007/978-3-319-69179-4_29

Shiwei Zhang¹⁸,
Xiuzhen Zhang¹⁸ &
Jeffrey Chan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3166 Accesses
1 Citations

Abstract

Most research on Twitter classification is focused on tweets in English. But Twitter supports over 40 languages and about 50% of tweets are non-English tweets. To fully use the Twitter contents, it is important to develop classifiers that can classify multilingual tweets or tweets of mixed languages (for example tweets mainly in Chinese but containing English words). The translation-based model is a classical approach to achieving multilingual or cross-lingual text classification. Recently character-based neural models are shown to be effective for text classification. But they are designed for limited European languages and require identification of languages to build an alphabet to encode and quantize characters. In this paper, we propose UniCNN (Unicode character Convolutional Networks), a fully language-independent character-based CNN model for the classification of tweets in multiple languages and mixed languages, not requiring language identification. Specifically, we propose to encode the sequence of characters in a tweet into a sequence of numerical UTF-8 codes, and then train a character-based CNN classifier. In addition, a character-based embedding layer is included before the convolutional layer for learning distributed character representation. We conducted experiments on Twitter datasets for multilingual sentiment classification in six languages and for mixed-language informativeness classification in over 40 languages. Our experiments showed that UniCNN mostly performed better than state-of-the-art neural models and traditional feature-based models, while not requiring the extra burden of any translation or tokenization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorow.org/
Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45175-4_13
Chapter Google Scholar
Chollet, F., et al.: Keras. (2015). https://github.com/fchollet/keras
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: EACL (2017)
Google Scholar
Cui, L., Zhang, X., Qin, A., Sellis, T., Wu, L.: CDS: collaborative distant supervision for Twitter account classification. Exp. Syst. Appl. 83, 94–103 (2017)
Article Google Scholar
Denecke, K.: Using SentiWordNet for multilingual sentiment analysis. In: ICDEW. IEEE (2008)
Google Scholar
Dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: COLING (2014)
Google Scholar
Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: NIPS (2016)
Google Scholar
Giachanou, A., Crestani, F.: Like it or not: a survey of Twitter sentiment analysis methods. ACM Comput. Surv. (CSUR) 49, 1–41 (2016)
Article Google Scholar
Gillick, D., Brunk, C., Vinyals, O., Subramanya, A.: Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103 (2015)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)
MathSciNet MATH Google Scholar
Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058 (2014)
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI (2016)
Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. In: TACL (2017)
Google Scholar
Mozetič, I., Grčar, M., Smailović, J.: Multilingual Twitter sentiment classification: the role of human annotators. PloS ONE 11, e0155036 (2016)
Article Google Scholar
Narr, S., Hulfenhaus, M., Albayrak, S.: Language-independent Twitter sentiment analysis. In: KDML (2012)
Google Scholar
Olteanu, A., Vieweg, S., Castillo, C.: What to expect when the unexpected happens: social media communications across crises. In: CSCW. ACM (2015)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Wehrmann, J., Becker, W., Cagnini, H.E., Barros, R.C.: A character-based convolutional neural network for language-agnostic Twitter sentiment analysis. In: IJCNN. IEEE (2017)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1, 80–83 (1945)
Article Google Scholar
Yang, Z., Dhingra, B., Yuan, Y., Hu, J., Cohen, W.W., Salakhutdinov, R.: Words or characters? Fine-grained gating for reading comprehension. In: ICLR (2017)
Google Scholar
Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Google Scholar
Zhou, X., Wan, X., Xiao, J.: Attention-based LSTM network for cross-lingual sentiment classification. In: EMNLP (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, Melbourne, 3001, Australia
Shiwei Zhang, Xiuzhen Zhang & Jeffrey Chan

Authors

Shiwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiuzhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Chan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiwei Zhang .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Gao Cong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Macquarie University, Sydney, New South Wales, Australia
Wei Emma Zhang
Wuhan University, Wuhan, China
Chengliang Li
Nanyang Technological University, Singapore, Singapore
Aixin Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S., Zhang, X., Chan, J. (2017). Language-Independent Twitter Classification Using Character-Based Convolutional Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-69179-4_29
Published: 14 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics