Abstract
In this paper, we present some approaches to diacritics restoration in Vietnamese, based on letters and syllables. Experiments with language-specified feature selection are conducted to evaluate contribution of different types of feature. Experimental results reveal that combination of Adaboost and C4.5, using letter-based feature set, achieves 94.7% accuracy, which is competitive with other systems for diacritics restoration in Vietnamese. Test data for diacritics restoration task in Vietnamese could be freely collected with simple preprocessing, whereas large test data for many natural language processing tasks in Vietnamese is lack. So, diacritic restoration could be used as an application-driven evaluation framework for lexical disambiguation tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 88–95 (1994)
Mihalcea, R.F.: Diacritics Restoration: Learning from Letters versus Learning from Words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 96–113. Springer, Heidelberg (2002)
Mitchell, T.M.: Decision Tree Learning. Machine Learning, 52–78 (1997)
Hông Phuong, L., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A hybrid approach to word segmentation of Vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)
De Pauw, G., et al.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Proceedings of 10th International Conference of Text, Speech and Dialogue, Pilsen, Czech Republic, September 3-7 (2007)
Simard, M.: Automatic Insertion of Accents in French Text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-3, Granada, Spain (1998)
Truyen, T.T., et al.: Constrained Sequence Classification for Lexical Disambiguation. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 430–441. Springer, Heidelberg (2008)
Nie, J.Y., et al.: On the Use of Words and N-grams for Chinese Information Retrieval. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, September 30-October 1 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, KH., Ock, CY. (2010). Diacritics Restoration in Vietnamese: Letter Based vs. Syllable Based Model. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_61
Download citation
DOI: https://doi.org/10.1007/978-3-642-15246-7_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)