Abstract
This paper presents fast and accurate method for recovering diacritical markings and guessing original meaning of the word from the context based on a hidden Markov model and the Viterbi algorithm. The proposed algorithm might find usage in any area where erroneous text might appear, such as a web search engine, e-mail messages, office suite, optical character recognition or helping to type on small mobile device keyboards.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bahanshal, A., Al-Khalifa, H.: A first approach to the evaluation of arabic diacritization systems, pp. 155–158 (2012)
De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic diacritic restoration for resource-scarce languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS(LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)
Grobbelaar, L., Kinyua, J.: A spell checker and corrector for the native South African language, South Sotho. In: Proceedings of 2009 Annual Conference of the Southern African Computer Lecturers’ Association, SACLA 2009, Mpekweni Beach Resort, South Africa, pp. 50–59 (2009)
Grozea, C.: Experiments and results with diacritics restoration in Romanian. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 199–206. Springer, Heidelberg (2012)
Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of the 12th International Conference on Research in Telecommunication Technologies, RTT, pp. 200–203 (2010)
Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of 12th International Conference on Research in Telecommunication Technologies, RTT 2010, Veľké Losiny, Czech Republic, pp. 137–140 (2010)
Hládek, D., Staš, J., Juhár, J.: Dagger: the Slovak morphological classifier, pp. 195–198 (2012)
Jayalatharachchi, E., Wasala, A., Weerasinghe, R.: Data-driven spell checking: the synergy of two algorithms for spelling error detection and correction. In: 2012 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 7–13. IEEE (2012)
Krajči, S., Mati, M., Novotný, R.: Morphonary: a Slovak language dictionary, tools for acquisition, organisation and presenting of information and knowledge. Návrat, P., et al. (eds.) Informatics and Information Technologies, pp. 162–165 (2006)
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Li, Y., Duan, H., Zhai, C.: A generalized hidden Markov model with discriminative training for query spelling correction, pp. 611–620 (2012)
Lund, W., Ringger, E.: Error correction with in-domain training across multiple OCR system outputs, pp. 658–662 (2011)
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)
Nguyen, K.-H., Ock, C.-Y.: Diacritics restoration in Vietnamese: letter based vs. syllable based model. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS(LNAI), vol. 6230, pp. 631–636. Springer, Heidelberg (2010)
Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011)
Rodphon, M., Siriboon, K., Kruatrachue, B.: Thai OCR error correction using token passing algorithm. In: 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2001, PACRIM, vol. 2, pp. 599–602. IEEE (2001)
Rusko, M., et al.: Slovak automatic dictation system for judicial domain. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS(LNAI), vol. 8387, pp. 16–27. Springer, Heidelberg (2014)
Sirts, K.: Noisy-channel spelling correction models for Estonian learner language corpus lemmatisation. In: Proceedings of the 5th International Conference Human Language Technologies - The Baltic Perspective, HLT 2012, Tartu, Estonia, pp. 213–220 (2012)
Staš, J., Hládek, D., Juhár, J.: Language model adaptation for Slovak LVCSR. In: Proceedings of the International Conference on AEI, pp. 101–106 (2010)
Staš, J., Hládek, D., Pleva, M., Juhár, J.: Slovak language model from internet text data. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 340–346. Springer, Heidelberg (2011)
Tufiş, D., Ceauşu, A.: Diacritics restoration in Romanian texts. In: A Common Natural Language Processing Paradigm for Balkan Languages, pp. 49–55 (2007)
Zhou, Y., Jing, S., Huang, G., Liu, S., Zhang, Y.: A correcting model based on tribayes for real-word errors in English essays. In: 2012 Fifth International Symposium on Computational Intelligence and Design (ISCID), vol. 1, pp. 407–410. IEEE (2012)
Zitouni, I., Sarikaya, R.: Arabic diacritic restoration approach based on maximum entropy models. Comput. Speech Lang. 23(3), 257–276 (2009)
Acknowledgement
The research presented in this paper was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the research project VEGA 1/0386/12 (50 %) and Research and Development Operational Program funded by the ERDF under the project ITMS-26220220141 (50 %).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Hládek, D., Staš, J., Juhár, J. (2016). Diacritics Restoration in the Slovak Texts Using Hidden Markov Model. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-43808-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)