Abstract
Word segmentation plays an important role in speech recognition as a text pre-processing step that helps decrease out-of-vocabulary items and lowers language model perplexity. Segmentation is applied mainly for agglutinative languages, but other morphologically rich languages, such as German, can also benefit from this technique. Using a relatively small, manually collected broadcast corpus of 134k tokens, the current study investigates how Finite-State Transducers (FSTs) can be applied to perform word segmentation in German. It is shown that FSTs incorporating word-formation rules can reach high segmentation performance with 0.97 precision and 0.93 recall rate. It is also shown that FSTs incorporating n-gram models of manually segmented data can reach even higher performance with accuracy and recall rates of 0.97. This result is remarkable considering the fact that the bottom-up approach performs on par with the expert system without requiring explicit knowledge about morphological categories or word formation rules.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Words are lowercased for clarity. In German nouns start with capital letters, so the segmentation would be more correctly: Zeitraum \(\rightarrow \) Zeit+Raum.
- 2.
However, a few OOV words were falsely segmented into—typically short—morphemes, leading to errors (e.g. Tories \(\rightarrow \) Tor+ie+s).
References
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General and Efficient Weighted Finite-State Transducer Library. In: Holub, J., Žďárek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76336-9_3
Arisoy, E., Saraclar, M.: Compositional neural network language models for agglutinative languages. In: Interspeech 2016, pp. 3494–3498 (2016)
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: ACL Workshop on Morphological and Phonological Learning, pp. 21–30 (2002)
El-Desoky, A., Shaik, A., Schlüter, R., Ney, H.: Sub-lexical language models for German LVCSR. In: Spoken Language Technology Workshop, pp. 159–164 (2010)
El-Desoky, A., Shaik, A., Schlüter, R., Ney, H.: Morpheme level feature-based language models for German LVCSR. In: Interspeech 2012, pp. 170–173 (2012)
Geutner, P.: Using morphology towards better large-vocabulary speech recognition systems. In: IEEE International Conference on Acoustic, Speech Signal Processing, vol. 1, pp. 445–448 (1995)
Geyken, A., Hanneforth, T.: TAGH: a complete morphology for german based on weighted finite state automata. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006). https://doi.org/10.1007/11780885_7
Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009)
Kang, S.-S., Hwang, K.-B.: A language independent n-gram model for word segmentation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 557–565. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439_60
Larson, M., Willett, D., Köhler, J., Rigoll, G.: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. In: Interspeech 2000, pp. 945–948 (2000)
Matsumoto, Y.: Easy to use practical freeware for natural language processing: morphological analysis system ChaSen. IPSJ Mag. 41(11), 1208–1214 (2000)
Mohri, M.: Finite-state transducers in language and speech processing. Comput. Linguist. 23(2), 269–311 (1997)
Nußbaum-Thom, M., El-Desoky, A., Schlüter, R., Ney, H.: Compound word recombination for German LVCSR. In: Interspeech 2011, pp. 1449–1452 (2011)
Renshaw, D., Hall, K.: Long short-term memory language models with additive morphological features for automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5246–5250 (2015)
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: ACL 2012 System Demonstrations, pp. 61–66 (2012)
Shaik, A., El-Desoky, A., Schlüter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR. In: Interspeech 2013, pp. 3404–3408 (2013)
Shamraev, N., Batalshchikov, A., Zulkarneev, M., Repalov, S., Shirokova, A.: Weighted finite-state transducer approach to german compound words reconstruction for speech recognition. In: AINL-ISMW FRUCT, pp. 96–101 (2015)
Smit, P., Virpioja, S., Kurimo, M.: Improved subword modeling for WFST-based speech recognition. In: Interspeech 2017, pp. 2551–2555 (2017)
Tachbelie, M., Abate, S., Menzel, W.: Using morphemes in language modeling and automatic speech recognition of Amharic. Nat. Lang. Eng. 20, 235–259 (2012)
Zablotskiy, S., Minker, W.: Sub-word language modeling for Russian LVCSR. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 413–421. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_51
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Pintér, G., Schielke, M., Petrick, R. (2018). Investigating Word Segmentation Techniques for German Using Finite-State Transducers. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)