A framework for efficient development of Slovenian written language resources used in speech processing applications

Rojc, Matej; Verdonik, Darinka; Kačič, Zdravko

doi:10.1007/s10772-009-9032-x

A framework for efficient development of Slovenian written language resources used in speech processing applications

Published: 07 May 2009

Volume 10, pages 121–141, (2007)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Matej Rojc¹,
Darinka Verdonik¹ &
Zdravko Kačič¹

53 Accesses
Explore all metrics

Abstract

This paper presents a framework for the efficient development and representation of morphological and phonetic lexicons, to be used in speech technology applications. Solutions that would be the most appropriate for developing speech technologies for specific language have to be analyzed when developing the lexicons. In the paper issues such as the development of resources, good word coverage in general texts, efficient coding of lexicons, representation (regarding time and memory space) and the integration of lexicons in speech processing applications are addressed. The construction process within the proposed framework is based on the use of finite-state machines and heterogeneous relation-graphs structures, and significantly reduces the time and effort needed for the construction of large-scale lexica, minimizes any analysis errors, and efficiently represents the lexicons, regarding time and memory usage. The wordlist construction process presented in the paper also guarantees that by using the constructed lexicons high word coverage is achieved in general texts. SIlex lexicons are large-scale phonetic and morphology lexicons for the Slovenian language, constructed within the new framework and with a developed toolset, and represent valuable language resources for the development of various speech processing applications for the Slovenian language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Language Processing

Near-term advances in quantum natural language processing

Article 11 April 2024

Early dementia detection with speech analysis and machine learning techniques

Article Open access 11 April 2024

References

Al-Shalabi, R., & Kanaan, G. (2004). Constructing an automatic lexicon for Arabic language. International Journal of Computing & Information Sciences, 2(2).
Bajec, A., Kolarič, R., & Rupel, M. (1956). Slovenska slovnica. Ljubljana, Svet za prosveto in kulturo LRS.
Boula, P., Yvon, F., Aubergé, V., & Vaissière, J. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the language resources and evaluation conference (LREC), Athens, Greece, May 2000.
Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. New York: Chapman & Hall.
MATH Google Scholar
Daciuk, J. (1998). Incremental construction of finite-state automata and transducers and their use in the natural language processing. Ph.D. thesis, Technical University of Gdansk, Poland.
Emmanuel, R., & Yves, S. (1997). Finite state language processing. Cambridge: MIT Press.
Google Scholar
Erjavec, T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the language resources and evaluation conference (LREC), Granada, Spain.
Günthner, F. (1996). CISLEX—Das Wörterbuch am CIS. www.cis.uni-muenchen.de/projects/CISLEX.html.
Hartikainen, E., Maltese, G., Moreno, A., Shammass, S., & Ziegenhain, U. (2003). Large lexica for speech-to-speech translation: from specification to creation. In Proceedings of the Eurospeech conference, Geneva, Switzerland, September 2003.
Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading: Addison-Wesley.
MATH Google Scholar
Kačič, Z. (1995). Onomastica for Slovenian. http://www.elda.fr/catalogue/speech/S0043.html.
Kiraz, G. A., & Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. In Proceedings of the third international workshop on speech synthesis, Australia.
Kuich, W., & Salomaa, A. (1986). EATCS monographs on theoretical computer science: Vol. 5. Semirings, automata, languages. Berlin: Springer.
MATH Google Scholar
Leech, G., & Wilson, A. (1996). Recommendations for the morphosyntactic annotation of corpora. EAGLES report EAG-TCWG-MAC/R, ILC, Pisa. http://www.ilc.cnr.it/EAGLES96/annotate/.
Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23, 2.
MathSciNet Google Scholar
Muhr, R., Höldrich, R., & Wächter-Kollpacher, E. (2002). The pronouncing dictionary of Austrian German and the other major varieties of German—a phonetic resources database on the pronunciation of German. In Proceedings of the language resources and evaluation conference (LREC), Las Palmas, Canary Islands, Spain, May 2002.
Pagel, V., Lenzo, K., & Black, A. W. (1998). Letter to sound rules for accented lexicon compression. In Proc. of ICSLP (pp. 2015–2018). Sydney, Australia, September 1998.
Piepenbrock, R. (2001). CELEX, the Dutch Centre for Lexical Information. http://www.kun.nl/celex/.
Rojc, M. (2000). Use of finite-state machines in automatic text-to-speech synthesis systems. Master thesis, Maribor.
Rojc, M. (2003). Time and space optimal architecture of the multilingual and polyglot TTS system—architecture with finite-state machines. Ph.D. thesis, Maribor.
Rojc, M., & Kačič, Z. (2000). A computational platform for development of morphologic and phonetic lexica. In Proceedings of the second language resources and evaluation conference (LREC), Athens, Greece.
SSKJ. (1995). Slovar slovenskega knjižnega jezika. Ljubljana: DZS.
Google Scholar
Taylor, P., Black, A., & Caley, R. (2001). Heterogeneous relation graphs as a mechanism for representing linguistic information. Speech Communication, 33, 153–174.
Article MATH Google Scholar
Toporišič, J. (1976). Slovenska slovnica. Maribor: Založba obzorja.
Google Scholar
Toporišič, J. (2000). Slovenska slovnica. Maribor: Založba obzorja.
Google Scholar
Toporišič, J. (2001). Slovenski pravopis. Ljubljana: Državna založba ZRC.
Google Scholar
Verdonik, D., Rojc, M., & Kačič, Z. (2004). Creating Slovenian language resources for development of speech-to-speech translation components. In Proceedings of the language resources and evaluation conference (LREC), Lisbon, Portugal, May 2004.
Vidovič Muha, A. (1981). Pomenske skupine nekakovostnih izpeljanih pridevnikov. Slavistična Revija, 29(1), 19–42.
Google Scholar
Zemljak, M., & Kačič, Z. (1998). SAMPA for Slovenian. http://www.phon.ucl.ac.uk/home/sampa/sloven-uni.html.
Ziegenhain, U. et al. (2004). Specification of corpora and word lists in 12 languages. LC-STAR project IST-2001-32216. Deliverable D1.1.

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, 2000, Maribor, Slovenia
Matej Rojc, Darinka Verdonik & Zdravko Kačič

Authors

Matej Rojc
View author publications
You can also search for this author in PubMed Google Scholar
Darinka Verdonik
View author publications
You can also search for this author in PubMed Google Scholar
Zdravko Kačič
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matej Rojc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rojc, M., Verdonik, D. & Kačič, Z. A framework for efficient development of Slovenian written language resources used in speech processing applications. Int J Speech Technol 10, 121–141 (2007). https://doi.org/10.1007/s10772-009-9032-x

Download citation

Received: 27 February 2006
Accepted: 22 April 2009
Published: 07 May 2009
Issue Date: September 2007
DOI: https://doi.org/10.1007/s10772-009-9032-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A framework for efficient development of Slovenian written language resources used in speech processing applications

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Near-term advances in quantum natural language processing

Early dementia detection with speech analysis and machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A framework for efficient development of Slovenian written language resources used in speech processing applications

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Near-term advances in quantum natural language processing

Early dementia detection with speech analysis and machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation