Skip to main content
Log in

A framework for efficient development of Slovenian written language resources used in speech processing applications

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper presents a framework for the efficient development and representation of morphological and phonetic lexicons, to be used in speech technology applications. Solutions that would be the most appropriate for developing speech technologies for specific language have to be analyzed when developing the lexicons. In the paper issues such as the development of resources, good word coverage in general texts, efficient coding of lexicons, representation (regarding time and memory space) and the integration of lexicons in speech processing applications are addressed. The construction process within the proposed framework is based on the use of finite-state machines and heterogeneous relation-graphs structures, and significantly reduces the time and effort needed for the construction of large-scale lexica, minimizes any analysis errors, and efficiently represents the lexicons, regarding time and memory usage. The wordlist construction process presented in the paper also guarantees that by using the constructed lexicons high word coverage is achieved in general texts. SIlex lexicons are large-scale phonetic and morphology lexicons for the Slovenian language, constructed within the new framework and with a developed toolset, and represent valuable language resources for the development of various speech processing applications for the Slovenian language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Al-Shalabi, R., & Kanaan, G. (2004). Constructing an automatic lexicon for Arabic language. International Journal of Computing & Information Sciences, 2(2).

  • Bajec, A., Kolarič, R., & Rupel, M. (1956). Slovenska slovnica. Ljubljana, Svet za prosveto in kulturo LRS.

  • Boula, P., Yvon, F., Aubergé, V., & Vaissière, J. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the language resources and evaluation conference (LREC), Athens, Greece, May 2000.

  • Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. New York: Chapman & Hall.

    MATH  Google Scholar 

  • Daciuk, J. (1998). Incremental construction of finite-state automata and transducers and their use in the natural language processing. Ph.D. thesis, Technical University of Gdansk, Poland.

  • Emmanuel, R., & Yves, S. (1997). Finite state language processing. Cambridge: MIT Press.

    Google Scholar 

  • Erjavec, T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the language resources and evaluation conference (LREC), Granada, Spain.

  • Günthner, F. (1996). CISLEX—Das Wörterbuch am CIS. www.cis.uni-muenchen.de/projects/CISLEX.html.

  • Hartikainen, E., Maltese, G., Moreno, A., Shammass, S., & Ziegenhain, U. (2003). Large lexica for speech-to-speech translation: from specification to creation. In Proceedings of the Eurospeech conference, Geneva, Switzerland, September 2003.

  • Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading: Addison-Wesley.

    MATH  Google Scholar 

  • Kačič, Z. (1995). Onomastica for Slovenian. http://www.elda.fr/catalogue/speech/S0043.html.

  • Kiraz, G. A., & Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. In Proceedings of the third international workshop on speech synthesis, Australia.

  • Kuich, W., & Salomaa, A. (1986). EATCS monographs on theoretical computer science: Vol. 5. Semirings, automata, languages. Berlin: Springer.

    MATH  Google Scholar 

  • Leech, G., & Wilson, A. (1996). Recommendations for the morphosyntactic annotation of corpora. EAGLES report EAG-TCWG-MAC/R, ILC, Pisa. http://www.ilc.cnr.it/EAGLES96/annotate/.

  • Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23, 2.

    MathSciNet  Google Scholar 

  • Muhr, R., Höldrich, R., & Wächter-Kollpacher, E. (2002). The pronouncing dictionary of Austrian German and the other major varieties of German—a phonetic resources database on the pronunciation of German. In Proceedings of the language resources and evaluation conference (LREC), Las Palmas, Canary Islands, Spain, May 2002.

  • Pagel, V., Lenzo, K., & Black, A. W. (1998). Letter to sound rules for accented lexicon compression. In Proc. of ICSLP (pp. 2015–2018). Sydney, Australia, September 1998.

  • Piepenbrock, R. (2001). CELEX, the Dutch Centre for Lexical Information. http://www.kun.nl/celex/.

  • Rojc, M. (2000). Use of finite-state machines in automatic text-to-speech synthesis systems. Master thesis, Maribor.

  • Rojc, M. (2003). Time and space optimal architecture of the multilingual and polyglot TTS system—architecture with finite-state machines. Ph.D. thesis, Maribor.

  • Rojc, M., & Kačič, Z. (2000). A computational platform for development of morphologic and phonetic lexica. In Proceedings of the second language resources and evaluation conference (LREC), Athens, Greece.

  • SSKJ. (1995). Slovar slovenskega knjižnega jezika. Ljubljana: DZS.

    Google Scholar 

  • Taylor, P., Black, A., & Caley, R. (2001). Heterogeneous relation graphs as a mechanism for representing linguistic information. Speech Communication, 33, 153–174.

    Article  MATH  Google Scholar 

  • Toporišič, J. (1976). Slovenska slovnica. Maribor: Založba obzorja.

    Google Scholar 

  • Toporišič, J. (2000). Slovenska slovnica. Maribor: Založba obzorja.

    Google Scholar 

  • Toporišič, J. (2001). Slovenski pravopis. Ljubljana: Državna založba ZRC.

    Google Scholar 

  • Verdonik, D., Rojc, M., & Kačič, Z. (2004). Creating Slovenian language resources for development of speech-to-speech translation components. In Proceedings of the language resources and evaluation conference (LREC), Lisbon, Portugal, May 2004.

  • Vidovič Muha, A. (1981). Pomenske skupine nekakovostnih izpeljanih pridevnikov. Slavistična Revija, 29(1), 19–42.

    Google Scholar 

  • Zemljak, M., & Kačič, Z. (1998). SAMPA for Slovenian. http://www.phon.ucl.ac.uk/home/sampa/sloven-uni.html.

  • Ziegenhain, U. et al. (2004). Specification of corpora and word lists in 12 languages. LC-STAR project IST-2001-32216. Deliverable D1.1.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matej Rojc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rojc, M., Verdonik, D. & Kačič, Z. A framework for efficient development of Slovenian written language resources used in speech processing applications. Int J Speech Technol 10, 121–141 (2007). https://doi.org/10.1007/s10772-009-9032-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-009-9032-x

Keywords

Navigation