Diacritics Restoration in Vietnamese: Letter Based vs. Syllable Based Model

Nguyen, Kiem-Hieu; Ock, Cheol-Young

doi:10.1007/978-3-642-15246-7_61

Kiem-Hieu Nguyen²¹ &
Cheol-Young Ock²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6230))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1639 Accesses
6 Citations

Abstract

In this paper, we present some approaches to diacritics restoration in Vietnamese, based on letters and syllables. Experiments with language-specified feature selection are conducted to evaluate contribution of different types of feature. Experimental results reveal that combination of Adaboost and C4.5, using letter-based feature set, achieves 94.7% accuracy, which is competitive with other systems for diacritics restoration in Vietnamese. Test data for diacritics restoration task in Vietnamese could be freely collected with simple preprocessing, whereas large test data for many natural language processing tasks in Vietnamese is lack. So, diacritic restoration could be used as an application-driven evaluation framework for lexical disambiguation tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 88–95 (1994)
Google Scholar
Mihalcea, R.F.: Diacritics Restoration: Learning from Letters versus Learning from Words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 96–113. Springer, Heidelberg (2002)
Chapter Google Scholar
Mitchell, T.M.: Decision Tree Learning. Machine Learning, 52–78 (1997)
Google Scholar
Hông Phuong, L., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A hybrid approach to word segmentation of Vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)
Chapter Google Scholar
De Pauw, G., et al.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Proceedings of 10th International Conference of Text, Speech and Dialogue, Pilsen, Czech Republic, September 3-7 (2007)
Google Scholar
Simard, M.: Automatic Insertion of Accents in French Text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-3, Granada, Spain (1998)
Google Scholar
Truyen, T.T., et al.: Constrained Sequence Classification for Lexical Disambiguation. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 430–441. Springer, Heidelberg (2008)
Chapter Google Scholar
Nie, J.Y., et al.: On the Use of Words and N-grams for Chinese Information Retrieval. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, September 30-October 1 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Lab, School of Computer Engineering and Information Technology, University of Ulsan, Korea
Kiem-Hieu Nguyen & Cheol-Young Ock

Authors

Kiem-Hieu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Cheol-Young Ock
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, Seoul National University, 151-744, Seoul, Korea
Byoung-Tak Zhang
Department of Computing,, Macquarie University, NSW, Sydney, Australia
Mehmet A. Orgun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, KH., Ock, CY. (2010). Diacritics Restoration in Vietnamese: Letter Based vs. Syllable Based Model. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_61

Download citation

DOI: https://doi.org/10.1007/978-3-642-15246-7_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics