Skip to main content

Vietnamese POS Tagging for Social Media Text

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9949))

Included in the following conference series:

Abstract

This paper presents an empirical study on Vietnamese part-of-speech (POS) tagging for social media text, which shows several challenges compared with tagging for general text. Social media text does not always conform to formal grammars and correct spelling. It also uses abbreviations, foreign words, and icons frequently. A POS tagger developed for conventional, edited text would perform poorly on such noisy data. We address this problem by proposing a tagging model based on Conditional random fields with various kinds of features for Vietnamese social media text. We introduce a corpus for POS tagging, which consists of more than four thousands sentences from Facebook, the most popular social network in Vietnam. Using this corpus, we performed a series of experiments to evaluate the proposed model. Our model achieved 88.26 % tagging accuracy, which is 11.27 % improvement over a state-of-the-art Vietnamese POS tagger developed for general, conventional text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://commons.apache.org/codec/.

  2. 2.

    http://mim.hus.vnu.edu.vn/phuonglh/softwares/vnTagger.

  3. 3.

    We collected the list from two links:

    http://kenh76.vn/ki-tu-ky-hieu-bieu-tuong-tren-facebook-chat-cap-nhat-2014.html, and https://en.wikipedia.org/wiki/List_of_emoticons.

References

  1. Albogamy, F., Ramsay, A.: POS tagging for Arabic tweets. In: Proceedings of RANLP, pp. 1–8 (2015)

    Google Scholar 

  2. Aldarmaki, H., Diab, M.: Robust part-of-speech tagging of Arabic text. In: Proceedings of the 2nd Workshop on Arabic NLP, pp. 173–182 (2015)

    Google Scholar 

  3. Bach, N.X., Hiraishi, K., Minh, N.L., Shimazu, A.: Dual decomposition for Vietnamese part-of-speech tagging. In: Proceedings of KES, pp. 123–131 (2013)

    Google Scholar 

  4. Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Comput. Linguist. 21(4), 543–565 (1995)

    MathSciNet  Google Scholar 

  5. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of ACL, pp. 42–47 (2011)

    Google Scholar 

  6. Kawahara, D., Kurohashi, S., Hasida, K.: Construction of a Japanese relevance-tagged corpus. In: Proceedings of LREC, pp. 2008–2013 (2002)

    Google Scholar 

  7. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)

    Google Scholar 

  8. Le, H.P., Roussanaly, A., Nguyen, T.M.H., Rossignol, M.: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In: Proceedings of TALN (2010)

    Google Scholar 

  9. Li, Z., Zhang, M., Che, W., Liu, T., Chen, W., Li, H.: Joint models for Chinese POS tagging and dependency parsing. In: Proceedings of EMNLP, pp. 1180–1191 (2011)

    Google Scholar 

  10. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  11. Nakagawa, T., Kudo, T., Matsumoto, Y.: Revision learning and its application to part-of-speech tagging. In: Proceedings of ACL, pp. 497–450 (2002)

    Google Scholar 

  12. Nakagawa, T., Uchimoto, K.: A hybrid approach to word segmentation and POS tagging. In: Proceedings of ACL, pp. 217–220 (2007)

    Google Scholar 

  13. Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL. LNCS, vol. 8105, pp. 139–150. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese POS tagging by integrating a rich feature set and support vector machines. In: Proceedings of RIVF, pp. 128–133 (2008)

    Google Scholar 

  15. Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP, pp. 182–185 (2009)

    Google Scholar 

  16. Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: Proceedings of KSE, pp. 141–146 (2010)

    Google Scholar 

  17. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL, pp. 380–390 (2013)

    Google Scholar 

  18. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of EMNLP, pp. 133–142 (1996)

    Google Scholar 

  19. Sha, F.P.: Shallow parsing with conditional random fields. In: Proceedings of NAACL, pp. 213–220 (2003)

    Google Scholar 

  20. Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of EMNLP, pp. 63–70 (2000)

    Google Scholar 

  21. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL, pp. 252–259 (2003)

    Google Scholar 

  22. Tran, T.O., Le, A.C., Ha, Q.T., Le, H.Q.: An experimental study on Vietnamese POS tagging. In: Proceedings of IALP, pp. 23–27 (2009)

    Google Scholar 

  23. Tran, T.O., Le, A.C., Ha, Q.T.: Improving Vietnamese word segmentation and POS tagging using MEM with various kinds of resources. J. Nat. Lang. Process. 17(3), 41–60 (2010)

    Article  Google Scholar 

  24. Vyas, Y., Gella, S.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of EMNLP, pp. 974–979 (2014)

    Google Scholar 

  25. Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of EMNLP, pp. 647–657 (2013)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by “2016 PTIT Research Grant”, Posts and Telecommunications Institute of Technology, Vietnam. We also would like to thank FPT for financial support which made this work possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ngo Xuan Bach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Bach, N.X., Linh, N.D., Phuong, T.M. (2016). Vietnamese POS Tagging for Social Media Text. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science(), vol 9949. Springer, Cham. https://doi.org/10.1007/978-3-319-46675-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46675-0_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46674-3

  • Online ISBN: 978-3-319-46675-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics