Investigating Word Segmentation Techniques for German Using Finite-State Transducers

Pintér, Gábor; Schielke, Mira; Petrick, Rico

doi:10.1007/978-3-319-99579-3_53

Gábor Pintér¹⁶,
Mira Schielke¹⁶ &
Rico Petrick¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1490 Accesses

Abstract

Word segmentation plays an important role in speech recognition as a text pre-processing step that helps decrease out-of-vocabulary items and lowers language model perplexity. Segmentation is applied mainly for agglutinative languages, but other morphologically rich languages, such as German, can also benefit from this technique. Using a relatively small, manually collected broadcast corpus of 134k tokens, the current study investigates how Finite-State Transducers (FSTs) can be applied to perform word segmentation in German. It is shown that FSTs incorporating word-formation rules can reach high segmentation performance with 0.97 precision and 0.93 recall rate. It is also shown that FSTs incorporating n-gram models of manually segmented data can reach even higher performance with accuracy and recall rates of 0.97. This result is remarkable considering the fact that the bottom-up approach performs on par with the expert system without requiring explicit knowledge about morphological categories or word formation rules.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Better Text Processing Tools for the Ainu Language

Compact Finite-State Super Transducers for Grapheme-to-Phoneme Conversion in Highly Inflected Languages

Mongolian Word Segmentation Based on Three Character Level Seq2Seq Models

Notes

1.
Words are lowercased for clarity. In German nouns start with capital letters, so the segmentation would be more correctly: Zeitraum $\rightarrow $ Zeit+Raum.
2.
However, a few OOV words were falsely segmented into—typically short—morphemes, leading to errors (e.g. Tories $\rightarrow $ Tor+ie+s).

References

Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General and Efficient Weighted Finite-State Transducer Library. In: Holub, J., Žďárek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76336-9_3
Chapter MATH Google Scholar
Arisoy, E., Saraclar, M.: Compositional neural network language models for agglutinative languages. In: Interspeech 2016, pp. 3494–3498 (2016)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: ACL Workshop on Morphological and Phonological Learning, pp. 21–30 (2002)
Google Scholar
El-Desoky, A., Shaik, A., Schlüter, R., Ney, H.: Sub-lexical language models for German LVCSR. In: Spoken Language Technology Workshop, pp. 159–164 (2010)
Google Scholar
El-Desoky, A., Shaik, A., Schlüter, R., Ney, H.: Morpheme level feature-based language models for German LVCSR. In: Interspeech 2012, pp. 170–173 (2012)
Google Scholar
Geutner, P.: Using morphology towards better large-vocabulary speech recognition systems. In: IEEE International Conference on Acoustic, Speech Signal Processing, vol. 1, pp. 445–448 (1995)
Google Scholar
Geyken, A., Hanneforth, T.: TAGH: a complete morphology for german based on weighted finite state automata. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006). https://doi.org/10.1007/11780885_7
Chapter Google Scholar
Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009)
Google Scholar
Kang, S.-S., Hwang, K.-B.: A language independent n-gram model for word segmentation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 557–565. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439_60
Chapter Google Scholar
Larson, M., Willett, D., Köhler, J., Rigoll, G.: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. In: Interspeech 2000, pp. 945–948 (2000)
Google Scholar
Matsumoto, Y.: Easy to use practical freeware for natural language processing: morphological analysis system ChaSen. IPSJ Mag. 41(11), 1208–1214 (2000)
Google Scholar
Mohri, M.: Finite-state transducers in language and speech processing. Comput. Linguist. 23(2), 269–311 (1997)
MathSciNet Google Scholar
Nußbaum-Thom, M., El-Desoky, A., Schlüter, R., Ney, H.: Compound word recombination for German LVCSR. In: Interspeech 2011, pp. 1449–1452 (2011)
Google Scholar
Renshaw, D., Hall, K.: Long short-term memory language models with additive morphological features for automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5246–5250 (2015)
Google Scholar
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: ACL 2012 System Demonstrations, pp. 61–66 (2012)
Google Scholar
Shaik, A., El-Desoky, A., Schlüter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR. In: Interspeech 2013, pp. 3404–3408 (2013)
Google Scholar
Shamraev, N., Batalshchikov, A., Zulkarneev, M., Repalov, S., Shirokova, A.: Weighted finite-state transducer approach to german compound words reconstruction for speech recognition. In: AINL-ISMW FRUCT, pp. 96–101 (2015)
Google Scholar
Smit, P., Virpioja, S., Kurimo, M.: Improved subword modeling for WFST-based speech recognition. In: Interspeech 2017, pp. 2551–2555 (2017)
Google Scholar
Tachbelie, M., Abate, S., Menzel, W.: Using morphemes in language modeling and automatic speech recognition of Amharic. Nat. Lang. Eng. 20, 235–259 (2012)
Article Google Scholar
Zablotskiy, S., Minker, W.: Sub-word language modeling for Russian LVCSR. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 413–421. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_51
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Linguwerk GmbH, 01069, Dresden, Germany
Gábor Pintér, Mira Schielke & Rico Petrick

Authors

Gábor Pintér
View author publications
You can also search for this author in PubMed Google Scholar
Mira Schielke
View author publications
You can also search for this author in PubMed Google Scholar
Rico Petrick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gábor Pintér .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pintér, G., Schielke, M., Petrick, R. (2018). Investigating Word Segmentation Techniques for German Using Finite-State Transducers. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_53
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics