Skip to main content

Language-Independent Twitter Classification Using Character-Based Convolutional Networks

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

Abstract

Most research on Twitter classification is focused on tweets in English. But Twitter supports over 40 languages and about 50% of tweets are non-English tweets. To fully use the Twitter contents, it is important to develop classifiers that can classify multilingual tweets or tweets of mixed languages (for example tweets mainly in Chinese but containing English words). The translation-based model is a classical approach to achieving multilingual or cross-lingual text classification. Recently character-based neural models are shown to be effective for text classification. But they are designed for limited European languages and require identification of languages to build an alphabet to encode and quantize characters. In this paper, we propose UniCNN (Unicode character Convolutional Networks), a fully language-independent character-based CNN model for the classification of tweets in multiple languages and mixed languages, not requiring language identification. Specifically, we propose to encode the sequence of characters in a tweet into a sequence of numerical UTF-8 codes, and then train a character-based CNN classifier. In addition, a character-based embedding layer is included before the convolutional layer for learning distributed character representation. We conducted experiments on Twitter datasets for multilingual sentiment classification in six languages and for mixed-language informativeness classification in over 40 languages. Our experiments showed that UniCNN mostly performed better than state-of-the-art neural models and traditional feature-based models, while not requiring the extra burden of any translation or tokenization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://latimesblogs.latimes.com/technology/2010/02/twitter-tweets-english.html.

  2. 2.

    https://dev.twitter.com/basics/counting-characters.

  3. 3.

    http://www.utf-8.com.

References

  1. TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorow.org/

  2. Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45175-4_13

    Chapter  Google Scholar 

  3. Chollet, F., et al.: Keras. (2015). https://github.com/fchollet/keras

  4. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: EACL (2017)

    Google Scholar 

  5. Cui, L., Zhang, X., Qin, A., Sellis, T., Wu, L.: CDS: collaborative distant supervision for Twitter account classification. Exp. Syst. Appl. 83, 94–103 (2017)

    Article  Google Scholar 

  6. Denecke, K.: Using SentiWordNet for multilingual sentiment analysis. In: ICDEW. IEEE (2008)

    Google Scholar 

  7. Dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: COLING (2014)

    Google Scholar 

  8. Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: NIPS (2016)

    Google Scholar 

  9. Giachanou, A., Crestani, F.: Like it or not: a survey of Twitter sentiment analysis methods. ACM Comput. Surv. (CSUR) 49, 1–41 (2016)

    Article  Google Scholar 

  10. Gillick, D., Brunk, C., Vinyals, O., Subramanya, A.: Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103 (2015)

  11. Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058 (2014)

  13. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014)

    Google Scholar 

  14. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI (2016)

    Google Scholar 

  15. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  16. Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. In: TACL (2017)

    Google Scholar 

  17. Mozetič, I., Grčar, M., Smailović, J.: Multilingual Twitter sentiment classification: the role of human annotators. PloS ONE 11, e0155036 (2016)

    Article  Google Scholar 

  18. Narr, S., Hulfenhaus, M., Albayrak, S.: Language-independent Twitter sentiment analysis. In: KDML (2012)

    Google Scholar 

  19. Olteanu, A., Vieweg, S., Castillo, C.: What to expect when the unexpected happens: social media communications across crises. In: CSCW. ACM (2015)

    Google Scholar 

  20. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  21. Wehrmann, J., Becker, W., Cagnini, H.E., Barros, R.C.: A character-based convolutional neural network for language-agnostic Twitter sentiment analysis. In: IJCNN. IEEE (2017)

    Google Scholar 

  22. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1, 80–83 (1945)

    Article  Google Scholar 

  23. Yang, Z., Dhingra, B., Yuan, Y., Hu, J., Cohen, W.W., Salakhutdinov, R.: Words or characters? Fine-grained gating for reading comprehension. In: ICLR (2017)

    Google Scholar 

  24. Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016)

  25. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)

    Google Scholar 

  26. Zhou, X., Wan, X., Xiao, J.: Attention-based LSTM network for cross-lingual sentiment classification. In: EMNLP (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiwei Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Zhang, S., Zhang, X., Chan, J. (2017). Language-Independent Twitter Classification Using Character-Based Convolutional Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69179-4_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69178-7

  • Online ISBN: 978-3-319-69179-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics