Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language

Eger, Steffen

doi:10.1007/978-3-319-23980-4_2

Steffen Eger¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 537))

Included in the following conference series:

International Workshop on Systems and Frameworks for Computational Morphology

288 Accesses

Abstract

We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability. We investigate whether general purpose string-to-string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations. We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In stemming, all that typically matters is that related words map to the same (linguistic or even non-linguistic) object.
2.
See, e.g., https://prepro.hucompute.org/.
3.
In our experiments below, we choose an n-gram order of size 6 for Phonetisaurus. Increasing n-gram order size did not lead to better performance in preliminary tests.
4.
We use the alignments produced by the Phonetisaurus toolkit.
5.
Although CRFs are rather old and typically not always the best-performing sequence labeling models [17], we use them here mainly for practical reasons. In particular, the CRF package we are using, available from https://code.google.com/p/crfpp/, provides a very convenient interface to modeling sequence labeling.
6.
Increasing window size typically does not lead to better performance, as we verified in preliminary experiments.
7.
Typically, word forms in other word classes are also not inflectional, so that the learning problem would be trivial.
8.
In fact, it seems that Mate simply stores input strings that occur fewer than 5 times, rather than learning substitution patterns from these (personal communication with Bernd Bohnet). Thus, the evaluation scenario adopted in this work puts Mate at a general disadvantage, since we generally train systems on arbitrary lists of word pairs selected from a lexicon rather than on the distributions found in ‘real’ text.
9.
E.g., when the lemmatizer is developed to assist a lexicon-based lemmatizer.
10.
We also performed the alternative decoding strategy where lemmatizers are separately trained, but found it to perform worse.
11.
Available at http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
12.
We could not use Perseus because the TreeTagger was trained on Perseus.

References

Bartlett, S., Kondrak, G., Cherry, C.: Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: McKeown, K., Moore, J.D., Teufel, S., Allan, J., Furui, S. (eds.) ACL, pp. 568–576. Association for Computational Linguistics, Morristown (2008)
Google Scholar
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Article Google Scholar
Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97, August 2010. http://www.aclweb.org/anthology/C10-1011
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293. Association for Computational Linguistics, Stroudsburg (2000)
Google Scholar
Daelemans, W., Groenewald, H.J., Huyssteen, G.B.V.: Prototype-based active learning for lemmatization. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) RANLP, pp. 65–70. RANLP 2009 Organising Committee/ACL, Morristown (2009)
Google Scholar
Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: EMNLP, pp. 1080–1089. ACL (2008)
Google Scholar
Eger, S.: Sequence segmentation by enumeration: an exploration. Prague Bull. Math. Linguist. 100, 113–132 (2013)
Article Google Scholar
Eger, S., vor der Brück, T., Mehler, A.: Lexicon-assisted tagging and lemmatization in Latin: a comparison of six taggers and two lemmatization methods. In: Latech 2015. Association for Computational Linguistics (2015, accepted)
Google Scholar
Gesmundo, A., Samardzic, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 368–372. Association for Computational Linguistics (2012). http://aclweb.org/anthology/P12-2072
Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus, June 2008. http://www.aclweb.org/anthology/P/P08/P08-1103
Jiampojamarn, S., Cherry, C., Kondrak, G.: Integrating joint n-gram features into a discriminative training framework. In: NAACL-HLT, pp. 697–700. Association for Computational Linguistics (2010)
Google Scholar
Juršič, M., Mozetič, I., Lavrač, N.: Learning ripple down rules for efficient lemmatization. In: Mladenić, D., Grobelnik, M. (eds.) Proceedings of the 10th International Multiconference Information Society, pp. 206–209. IJS, Ljubljana (2007)
Google Scholar
Juršič, M., Mozetič, I., Lavrač, N.: LemmaGen: multilingual lemmatisation with induced ripple-down rules. J. Univ. Comput. Sci. 16, 1190–1214 (2010)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Mehler, A., vor der Brück, T., Gleim, R., Geelhaar, T.: Towards a network model of the coreness of texts: an experiment in classifying Latin texts using the ttlab Latin tagger. In: Biemann, C., Mehler, A. (eds.) Text Mining: From Ontology Learning to Automated text Processing Applications. Theory and Applications of Natural Language Processing, pp. 87–112. Springer, Berlin (2015)
Google Scholar
Migne, J.P. (ed.): Patrologiae Cursus Completus: Series Latina, vol. 1–221. Chadwyck-Healey, Cambridge (1844–1855)
Google Scholar
Nguyen, N., Guo, Y.: Comparisons of sequence labeling algorithms and extensions. In: Ghahramani, Z. (ed.) ICML. ACM International Conference Proceeding Series, vol. 227, pp. 681–688. ACM, New York (2007)
Chapter Google Scholar
Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49. Association for Computational Linguistics, Donostia-San Sebasti, July 2012. http://www.aclweb.org/anthology/W12-6208
Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)
Article Google Scholar
Richmond, K., Clark, R.A.J., Fitt, S.: Robust LTS rules with the Combilex speech technology lexicon. In: INTERSPEECH, pp. 1295–1298. ISCA (2009)
Google Scholar
Sherif, T., Kondrak, G.: Substring-based transliteration. In: Carroll, J.A., van den Bosch, A., Zaenen, A. (eds.) ACL. Association for Computational Linguistics, Morristown (2007)
Google Scholar
Smith, D.A., Rydberg-Cox, J.A., Crane, G.R.: The Perseus project: a digital library for the humanities. Literary and Linguistic Computing 15(1), 15–25 (2000). http://llc.oxfordjournals.org/content/15/1/15
Article Google Scholar
Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Su, K.Y., Su, J., Wiebe, J. (eds.) ACL/IJCNLP, pp. 486–494. Association for Computational Linguistics, Morristown (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Text Technology Lab, Goethe University, Frankfurt am Main, Germany
Steffen Eger

Authors

Steffen Eger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steffen Eger .

Editor information

Editors and Affiliations

Institut für Deutsche Sprache, Mannheim, Germany
Cerstin Mahlow
Leibniz Institute of European History, Mainz, Germany
Michael Piotrowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eger, S. (2015). Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2015. Communications in Computer and Information Science, vol 537. Springer, Cham. https://doi.org/10.1007/978-3-319-23980-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-23980-4_2
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23978-1
Online ISBN: 978-3-319-23980-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics