Skip to main content

Diacritics Restoration in the Slovak Texts Using Hidden Markov Model

  • Conference paper
  • First Online:
Human Language Technology. Challenges for Computer Science and Linguistics (LTC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9561))

Included in the following conference series:

Abstract

This paper presents fast and accurate method for recovering diacritical markings and guessing original meaning of the word from the context based on a hidden Markov model and the Viterbi algorithm. The proposed algorithm might find usage in any area where erroneous text might appear, such as a web search engine, e-mail messages, office suite, optical character recognition or helping to type on small mobile device keyboards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bahanshal, A., Al-Khalifa, H.: A first approach to the evaluation of arabic diacritization systems, pp. 155–158 (2012)

    Google Scholar 

  2. De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic diacritic restoration for resource-scarce languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS(LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Grobbelaar, L., Kinyua, J.: A spell checker and corrector for the native South African language, South Sotho. In: Proceedings of 2009 Annual Conference of the Southern African Computer Lecturers’ Association, SACLA 2009, Mpekweni Beach Resort, South Africa, pp. 50–59 (2009)

    Google Scholar 

  4. Grozea, C.: Experiments and results with diacritics restoration in Romanian. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 199–206. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of the 12th International Conference on Research in Telecommunication Technologies, RTT, pp. 200–203 (2010)

    Google Scholar 

  6. Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of 12th International Conference on Research in Telecommunication Technologies, RTT 2010, Veľké Losiny, Czech Republic, pp. 137–140 (2010)

    Google Scholar 

  7. Hládek, D., Staš, J., Juhár, J.: Dagger: the Slovak morphological classifier, pp. 195–198 (2012)

    Google Scholar 

  8. Jayalatharachchi, E., Wasala, A., Weerasinghe, R.: Data-driven spell checking: the synergy of two algorithms for spelling error detection and correction. In: 2012 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 7–13. IEEE (2012)

    Google Scholar 

  9. Krajči, S., Mati, M., Novotný, R.: Morphonary: a Slovak language dictionary, tools for acquisition, organisation and presenting of information and knowledge. Návrat, P., et al. (eds.) Informatics and Information Technologies, pp. 162–165 (2006)

    Google Scholar 

  10. Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)

    Article  Google Scholar 

  11. Li, Y., Duan, H., Zhai, C.: A generalized hidden Markov model with discriminative training for query spelling correction, pp. 611–620 (2012)

    Google Scholar 

  12. Lund, W., Ringger, E.: Error correction with in-domain training across multiple OCR system outputs, pp. 658–662 (2011)

    Google Scholar 

  13. Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Nguyen, K.-H., Ock, C.-Y.: Diacritics restoration in Vietnamese: letter based vs. syllable based model. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS(LNAI), vol. 6230, pp. 631–636. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011)

    Article  Google Scholar 

  16. Rodphon, M., Siriboon, K., Kruatrachue, B.: Thai OCR error correction using token passing algorithm. In: 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2001, PACRIM, vol. 2, pp. 599–602. IEEE (2001)

    Google Scholar 

  17. Rusko, M., et al.: Slovak automatic dictation system for judicial domain. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS(LNAI), vol. 8387, pp. 16–27. Springer, Heidelberg (2014)

    Google Scholar 

  18. Sirts, K.: Noisy-channel spelling correction models for Estonian learner language corpus lemmatisation. In: Proceedings of the 5th International Conference Human Language Technologies - The Baltic Perspective, HLT 2012, Tartu, Estonia, pp. 213–220 (2012)

    Google Scholar 

  19. Staš, J., Hládek, D., Juhár, J.: Language model adaptation for Slovak LVCSR. In: Proceedings of the International Conference on AEI, pp. 101–106 (2010)

    Google Scholar 

  20. Staš, J., Hládek, D., Pleva, M., Juhár, J.: Slovak language model from internet text data. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 340–346. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  21. Tufiş, D., Ceauşu, A.: Diacritics restoration in Romanian texts. In: A Common Natural Language Processing Paradigm for Balkan Languages, pp. 49–55 (2007)

    Google Scholar 

  22. Zhou, Y., Jing, S., Huang, G., Liu, S., Zhang, Y.: A correcting model based on tribayes for real-word errors in English essays. In: 2012 Fifth International Symposium on Computational Intelligence and Design (ISCID), vol. 1, pp. 407–410. IEEE (2012)

    Google Scholar 

  23. Zitouni, I., Sarikaya, R.: Arabic diacritic restoration approach based on maximum entropy models. Comput. Speech Lang. 23(3), 257–276 (2009)

    Article  Google Scholar 

Download references

Acknowledgement

The research presented in this paper was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the research project VEGA 1/0386/12 (50 %) and Research and Development Operational Program funded by the ERDF under the project ITMS-26220220141 (50 %).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Hládek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Hládek, D., Staš, J., Juhár, J. (2016). Diacritics Restoration in the Slovak Texts Using Hidden Markov Model. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43808-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43807-8

  • Online ISBN: 978-3-319-43808-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics