Abstract
It is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-accuracy annotated corpus. This paper proposes a transfer learning strategy to build a high-quality tagger for Vietnamese using a bilingual corpus Vietnamese-English. Particularly, We inherit the strength of a POS tagger in English, which is constructed from a large-scale corpus, then transfer the knowledge to the Vietnamese POS tagger via word alignment. Experimental results show that the proposed method achieves high accuracy with \(94.97\%\). The transfer strategy depends mostly on the source tagger’s accuracy and the word alignment process’s performance, so the proposed strategy can be extended and applied to another low-resource language as long as there is a large bilingual corpus with a rich resource language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brants, T.: TnT: a statistical part-of-speech tagger. ANLP (2002). https://doi.org/10.3115/974147.974178
Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: Word-sense disambiguation using statistical methods. In: 29th Annual Meeting of the Association for Computational Linguistics, pp. 264–270 (1991). https://doi.org/10.3115/981344.981378
Burkett, D., Klein, D.: Two languages are better than one (for syntactic parsing). In: Proceedings of EMNLP, pp. 877–886 (2008). https://doi.org/10.3115/1613715.1613828
Burkett, D., Petrov, S., Blitzer, J., Klein, D.: Learning better monolingual models with unannotated bilingual text. In: Proceedings of CoNLL, pp. 46–54 (2010)
Denis, P., Sagot, B.: Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In: PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, vol. 1, pp. 110–119 (2009)
Dien, D., Kiem, H.: POS-tagger for English-Vietnamese bilingual corpus. In: Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, pp. 88–95 (2003). https://doi.org/10.3115/1118905.1118921
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Minh, N., Xuan Bach, N., Nguyen, V.C., Minh, P.Q.N., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: Proceedings - 2nd International Conference on Knowledge and Systems Engineering (KSE 2010), pp. 141–146 (2010). https://doi.org/10.1109/KSE.2010.35
Naseem, T., Snyder, B., Eisenstein, J., Barzilay, R.: Multilingual part-of-speech tagging two unsupervised approaches. J. Artif. Intell. Res. 36, 341–385 (2009). https://doi.org/10.1613/jair.2843
Nguyen, H., Romary, L., Rossignol, M., Vũ, X.: A lexicon for Vietnamese language processing. Lang. Resour. Eval. 40, 291–309 (2006). https://doi.org/10.1007/s10579-007-9034-8
Nguyen, H., Vu, X., Phuong, L.H.: A case study of the probabilistic tagger QTAG for tagging Vietnamese texts. In: Proceedings of 10th TALN (2003)
Nguyen, Q.T., Miyao, Y., Le, H.T.T., Nguyen, N.T.H.: Ensuring annotation consistency and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315 (2017). https://doi.org/10.1007/s10579-017-9398-3
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Phan, X.H., Le-Minh, N., Inoguchi, Y.: Co-training of conditional random fields for segmenting sequence data. In: Proceedings of Symposium on Data/Text Mining from Large Databases (IFSR) (2022)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4512–4525 (2020)
Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Unsupervised multilingual learning for POS tagging. In: Proceedings of EMNLP, pp. 1041–1050 (2008). https://doi.org/10.3115/1613715.1613851
Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Adding more languages improves unsupervised multilingual part-of-speech tagging: a Bayesian non-parametric approach. Association for Computational Linguistics, pp. 83–91 (2009)
Sun, X.: Structure regularization for structured prediction. Adv. Neural Inf. Process. Syst. 3 (2014)
Thede, S., Harper, M.: A second-order hidden Markov model for part-of-speech tagging. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 175–182 (2002). https://doi.org/10.3115/1034678.1034712
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 252–259 (2004). https://doi.org/10.3115/1073445.1073478
Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (2002). https://doi.org/10.3115/1117794.1117802
Wang, M.: Bilingual and cross-lingual learning of sequence models with bitext. PhD Thesis, Stanford University (2014)
Wang, M., Che, W., Manning, C.: Joint word alignment and bilingual named entity recognition using dual decomposition. In: ACL 2013–Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1073–1082 (2013)
Yarowsky, D.: Hierarchical decision lists for word sense disambiguation. Comput. Hum. 34, 179–186 (2000). https://doi.org/10.1023/A:1002674829964
Yarowsky, D., Ngai, G.: Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In: Proceedings of NAACL (2001). https://doi.org/10.3115/1073336.1073362
Acknowledgement
Hao D. Do was funded by Vingroup JSC and supported by the Ph.D. Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.120. The author would like to thank the Computational Linguistics Center (Lab C44), University of Science - VNUHCM, for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Do, H.D. (2023). Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging. In: Dinh, T.N., Li, M. (eds) Computational Data and Social Networks . CSoNet 2022. Lecture Notes in Computer Science, vol 13831. Springer, Cham. https://doi.org/10.1007/978-3-031-26303-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-26303-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26302-6
Online ISBN: 978-3-031-26303-3
eBook Packages: Computer ScienceComputer Science (R0)