Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging

Do, Hao D.

doi:10.1007/978-3-031-26303-3_8

Hao D. Do ORCID: orcid.org/0000-0002-9014-1506⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13831))

Included in the following conference series:

International Conference on Computational Data and Social Networks

626 Accesses

Abstract

It is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-accuracy annotated corpus. This paper proposes a transfer learning strategy to build a high-quality tagger for Vietnamese using a bilingual corpus Vietnamese-English. Particularly, We inherit the strength of a POS tagger in English, which is constructed from a large-scale corpus, then transfer the knowledge to the Vietnamese POS tagger via word alignment. Experimental results show that the proposed method achieves high accuracy with $94.97\%$. The transfer strategy depends mostly on the source tagger’s accuracy and the word alignment process’s performance, so the proposed strategy can be extended and applied to another low-resource language as long as there is a large bilingual corpus with a rich resource language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Proposing a Semantic Tagging Model on Bilingual English Vietnamese Corpus

Pre-trained Language Models for Tagalog with Multi-source Data

Chinese POS Tagging Method Based on Bi-GRU+CRF Hybrid Model

References

Brants, T.: TnT: a statistical part-of-speech tagger. ANLP (2002). https://doi.org/10.3115/974147.974178
Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: Word-sense disambiguation using statistical methods. In: 29th Annual Meeting of the Association for Computational Linguistics, pp. 264–270 (1991). https://doi.org/10.3115/981344.981378
Burkett, D., Klein, D.: Two languages are better than one (for syntactic parsing). In: Proceedings of EMNLP, pp. 877–886 (2008). https://doi.org/10.3115/1613715.1613828
Burkett, D., Petrov, S., Blitzer, J., Klein, D.: Learning better monolingual models with unannotated bilingual text. In: Proceedings of CoNLL, pp. 46–54 (2010)
Google Scholar
Denis, P., Sagot, B.: Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In: PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, vol. 1, pp. 110–119 (2009)
Google Scholar
Dien, D., Kiem, H.: POS-tagger for English-Vietnamese bilingual corpus. In: Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, pp. 88–95 (2003). https://doi.org/10.3115/1118905.1118921
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Minh, N., Xuan Bach, N., Nguyen, V.C., Minh, P.Q.N., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: Proceedings - 2nd International Conference on Knowledge and Systems Engineering (KSE 2010), pp. 141–146 (2010). https://doi.org/10.1109/KSE.2010.35
Naseem, T., Snyder, B., Eisenstein, J., Barzilay, R.: Multilingual part-of-speech tagging two unsupervised approaches. J. Artif. Intell. Res. 36, 341–385 (2009). https://doi.org/10.1613/jair.2843
Article MATH Google Scholar
Nguyen, H., Romary, L., Rossignol, M., Vũ, X.: A lexicon for Vietnamese language processing. Lang. Resour. Eval. 40, 291–309 (2006). https://doi.org/10.1007/s10579-007-9034-8
Article Google Scholar
Nguyen, H., Vu, X., Phuong, L.H.: A case study of the probabilistic tagger QTAG for tagging Vietnamese texts. In: Proceedings of 10th TALN (2003)
Google Scholar
Nguyen, Q.T., Miyao, Y., Le, H.T.T., Nguyen, N.T.H.: Ensuring annotation consistency and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315 (2017). https://doi.org/10.1007/s10579-017-9398-3
Article Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Phan, X.H., Le-Minh, N., Inoguchi, Y.: Co-training of conditional random fields for segmenting sequence data. In: Proceedings of Symposium on Data/Text Mining from Large Databases (IFSR) (2022)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4512–4525 (2020)
Google Scholar
Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Unsupervised multilingual learning for POS tagging. In: Proceedings of EMNLP, pp. 1041–1050 (2008). https://doi.org/10.3115/1613715.1613851
Snyder, B., Naseem, T., Eisenstein, J., Barzilay, R.: Adding more languages improves unsupervised multilingual part-of-speech tagging: a Bayesian non-parametric approach. Association for Computational Linguistics, pp. 83–91 (2009)
Google Scholar
Sun, X.: Structure regularization for structured prediction. Adv. Neural Inf. Process. Syst. 3 (2014)
Google Scholar
Thede, S., Harper, M.: A second-order hidden Markov model for part-of-speech tagging. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 175–182 (2002). https://doi.org/10.3115/1034678.1034712
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 252–259 (2004). https://doi.org/10.3115/1073445.1073478
Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (2002). https://doi.org/10.3115/1117794.1117802
Wang, M.: Bilingual and cross-lingual learning of sequence models with bitext. PhD Thesis, Stanford University (2014)
Google Scholar
Wang, M., Che, W., Manning, C.: Joint word alignment and bilingual named entity recognition using dual decomposition. In: ACL 2013–Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1073–1082 (2013)
Google Scholar
Yarowsky, D.: Hierarchical decision lists for word sense disambiguation. Comput. Hum. 34, 179–186 (2000). https://doi.org/10.1023/A:1002674829964
Article Google Scholar
Yarowsky, D., Ngai, G.: Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In: Proceedings of NAACL (2001). https://doi.org/10.3115/1073336.1073362

Download references

Acknowledgement

Hao D. Do was funded by Vingroup JSC and supported by the Ph.D. Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.120. The author would like to thank the Computational Linguistics Center (Lab C44), University of Science - VNUHCM, for their support.

Author information

Authors and Affiliations

FPT University, Ho Chi Minh City, Vietnam
Hao D. Do

Authors

Hao D. Do
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao D. Do .

Editor information

Editors and Affiliations

Virginia Commonwealth University, Richmond, VA, USA
Thang N. Dinh
Yeung Kin Man Academic Building, City University Hong Kong, Kowloon Tong, Hong Kong
Minming Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, H.D. (2023). Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging. In: Dinh, T.N., Li, M. (eds) Computational Data and Social Networks . CSoNet 2022. Lecture Notes in Computer Science, vol 13831. Springer, Cham. https://doi.org/10.1007/978-3-031-26303-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-26303-3_8
Published: 11 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26302-6
Online ISBN: 978-3-031-26303-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging