Skip to main content

Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging

  • Conference paper
  • First Online:
Computational Data and Social Networks (CSoNet 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13831))

Included in the following conference series:

  • 471 Accesses

Abstract

It is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-accuracy annotated corpus. This paper proposes a transfer learning strategy to build a high-quality tagger for Vietnamese using a bilingual corpus Vietnamese-English. Particularly, We inherit the strength of a POS tagger in English, which is constructed from a large-scale corpus, then transfer the knowledge to the Vietnamese POS tagger via word alignment. Experimental results show that the proposed method achieves high accuracy with \(94.97\%\). The transfer strategy depends mostly on the source tagger’s accuracy and the word alignment process’s performance, so the proposed strategy can be extended and applied to another low-resource language as long as there is a large bilingual corpus with a rich resource language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brants, T.: TnT: a statistical part-of-speech tagger. ANLP (2002). https://doi.org/10.3115/974147.974178

  2. Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: Word-sense disambiguation using statistical methods. In: 29th Annual Meeting of the Association for Computational Linguistics, pp. 264–270 (1991). https://doi.org/10.3115/981344.981378

  3. Burkett, D., Klein, D.: Two languages are better than one (for syntactic parsing). In: Proceedings of EMNLP, pp. 877–886 (2008). https://doi.org/10.3115/1613715.1613828

  4. Burkett, D., Petrov, S., Blitzer, J., Klein, D.: Learning better monolingual models with unannotated bilingual text. In: Proceedings of CoNLL, pp. 46–54 (2010)

    Google Scholar 

  5. Denis, P., Sagot, B.: Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In: PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, vol. 1, pp. 110–119 (2009)

    Google Scholar 

  6. Dien, D., Kiem, H.: POS-tagger for English-Vietnamese bilingual corpus. In: Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, pp. 88–95 (2003). https://doi.org/10.3115/1118905.1118921

  7. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  8. Minh, N., Xuan Bach, N., Nguyen, V.C., Minh, P.Q.N., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: Proceedings - 2nd International Conference on Knowledge and Systems Engineering (KSE 2010), pp. 141–146 (2010). https://doi.org/10.1109/KSE.2010.35

  9. Naseem, T., Snyder, B., Eisenstein, J., Barzilay, R.: Multilingual part-of-speech tagging two unsupervised approaches. J. Artif. Intell. Res. 36, 341–385 (2009). https://doi.org/10.1613/jair.2843

    Article  MATH  Google Scholar 

  10. Nguyen, H., Romary, L., Rossignol, M., Vũ, X.: A lexicon for Vietnamese language processing. Lang. Resour. Eval. 40, 291–309 (2006). https://doi.org/10.1007/s10579-007-9034-8

    Article  Google Scholar 

  11. Nguyen, H., Vu, X., Phuong, L.H.: A case study of the probabilistic tagger QTAG for tagging Vietnamese texts. In: Proceedings of 10th TALN (2003)

    Google Scholar 

  12. Nguyen, Q.T., Miyao, Y., Le, H.T.T., Nguyen, N.T.H.: Ensuring annotation consistency and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315 (2017). https://doi.org/10.1007/s10579-017-9398-3

    Article  Google Scholar 

  13. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  14. Phan, X.H., Le-Minh, N., Inoguchi, Y.: Co-training of conditional random fields for segmenting sequence data. In: Proceedings of Symposium on Data/Text Mining from Large Databases (IFSR) (2022)

    Google Scholar 

  15. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4512–4525 (2020)

    Google Scholar 

  16. Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Unsupervised multilingual learning for POS tagging. In: Proceedings of EMNLP, pp. 1041–1050 (2008). https://doi.org/10.3115/1613715.1613851

  17. Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Adding more languages improves unsupervised multilingual part-of-speech tagging: a Bayesian non-parametric approach. Association for Computational Linguistics, pp. 83–91 (2009)

    Google Scholar 

  18. Sun, X.: Structure regularization for structured prediction. Adv. Neural Inf. Process. Syst. 3 (2014)

    Google Scholar 

  19. Thede, S., Harper, M.: A second-order hidden Markov model for part-of-speech tagging. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 175–182 (2002). https://doi.org/10.3115/1034678.1034712

  20. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 252–259 (2004). https://doi.org/10.3115/1073445.1073478

  21. Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (2002). https://doi.org/10.3115/1117794.1117802

  22. Wang, M.: Bilingual and cross-lingual learning of sequence models with bitext. PhD Thesis, Stanford University (2014)

    Google Scholar 

  23. Wang, M., Che, W., Manning, C.: Joint word alignment and bilingual named entity recognition using dual decomposition. In: ACL 2013–Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1073–1082 (2013)

    Google Scholar 

  24. Yarowsky, D.: Hierarchical decision lists for word sense disambiguation. Comput. Hum. 34, 179–186 (2000). https://doi.org/10.1023/A:1002674829964

    Article  Google Scholar 

  25. Yarowsky, D., Ngai, G.: Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In: Proceedings of NAACL (2001). https://doi.org/10.3115/1073336.1073362

Download references

Acknowledgement

Hao D. Do was funded by Vingroup JSC and supported by the Ph.D. Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.120. The author would like to thank the Computational Linguistics Center (Lab C44), University of Science - VNUHCM, for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao D. Do .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Do, H.D. (2023). Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging. In: Dinh, T.N., Li, M. (eds) Computational Data and Social Networks . CSoNet 2022. Lecture Notes in Computer Science, vol 13831. Springer, Cham. https://doi.org/10.1007/978-3-031-26303-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26303-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26302-6

  • Online ISBN: 978-3-031-26303-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics