Abstract
We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability. We investigate whether general purpose string-to-string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations. We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In stemming, all that typically matters is that related words map to the same (linguistic or even non-linguistic) object.
- 2.
See, e.g., https://prepro.hucompute.org/.
- 3.
In our experiments below, we choose an n-gram order of size 6 for Phonetisaurus. Increasing n-gram order size did not lead to better performance in preliminary tests.
- 4.
We use the alignments produced by the Phonetisaurus toolkit.
- 5.
Although CRFs are rather old and typically not always the best-performing sequence labeling models [17], we use them here mainly for practical reasons. In particular, the CRF package we are using, available from https://code.google.com/p/crfpp/, provides a very convenient interface to modeling sequence labeling.
- 6.
Increasing window size typically does not lead to better performance, as we verified in preliminary experiments.
- 7.
Typically, word forms in other word classes are also not inflectional, so that the learning problem would be trivial.
- 8.
In fact, it seems that Mate simply stores input strings that occur fewer than 5 times, rather than learning substitution patterns from these (personal communication with Bernd Bohnet). Thus, the evaluation scenario adopted in this work puts Mate at a general disadvantage, since we generally train systems on arbitrary lists of word pairs selected from a lexicon rather than on the distributions found in ‘real’ text.
- 9.
E.g., when the lemmatizer is developed to assist a lexicon-based lemmatizer.
- 10.
We also performed the alternative decoding strategy where lemmatizers are separately trained, but found it to perform worse.
- 11.
Available at http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
- 12.
We could not use Perseus because the TreeTagger was trained on Perseus.
References
Bartlett, S., Kondrak, G., Cherry, C.: Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: McKeown, K., Moore, J.D., Teufel, S., Allan, J., Furui, S. (eds.) ACL, pp. 568–576. Association for Computational Linguistics, Morristown (2008)
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97, August 2010. http://www.aclweb.org/anthology/C10-1011
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293. Association for Computational Linguistics, Stroudsburg (2000)
Daelemans, W., Groenewald, H.J., Huyssteen, G.B.V.: Prototype-based active learning for lemmatization. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) RANLP, pp. 65–70. RANLP 2009 Organising Committee/ACL, Morristown (2009)
Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: EMNLP, pp. 1080–1089. ACL (2008)
Eger, S.: Sequence segmentation by enumeration: an exploration. Prague Bull. Math. Linguist. 100, 113–132 (2013)
Eger, S., vor der Brück, T., Mehler, A.: Lexicon-assisted tagging and lemmatization in Latin: a comparison of six taggers and two lemmatization methods. In: Latech 2015. Association for Computational Linguistics (2015, accepted)
Gesmundo, A., Samardzic, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 368–372. Association for Computational Linguistics (2012). http://aclweb.org/anthology/P12-2072
Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus, June 2008. http://www.aclweb.org/anthology/P/P08/P08-1103
Jiampojamarn, S., Cherry, C., Kondrak, G.: Integrating joint n-gram features into a discriminative training framework. In: NAACL-HLT, pp. 697–700. Association for Computational Linguistics (2010)
Juršič, M., Mozetič, I., Lavrač, N.: Learning ripple down rules for efficient lemmatization. In: Mladenić, D., Grobelnik, M. (eds.) Proceedings of the 10th International Multiconference Information Society, pp. 206–209. IJS, Ljubljana (2007)
Juršič, M., Mozetič, I., Lavrač, N.: LemmaGen: multilingual lemmatisation with induced ripple-down rules. J. Univ. Comput. Sci. 16, 1190–1214 (2010)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Mehler, A., vor der Brück, T., Gleim, R., Geelhaar, T.: Towards a network model of the coreness of texts: an experiment in classifying Latin texts using the ttlab Latin tagger. In: Biemann, C., Mehler, A. (eds.) Text Mining: From Ontology Learning to Automated text Processing Applications. Theory and Applications of Natural Language Processing, pp. 87–112. Springer, Berlin (2015)
Migne, J.P. (ed.): Patrologiae Cursus Completus: Series Latina, vol. 1–221. Chadwyck-Healey, Cambridge (1844–1855)
Nguyen, N., Guo, Y.: Comparisons of sequence labeling algorithms and extensions. In: Ghahramani, Z. (ed.) ICML. ACM International Conference Proceeding Series, vol. 227, pp. 681–688. ACM, New York (2007)
Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49. Association for Computational Linguistics, Donostia-San Sebasti, July 2012. http://www.aclweb.org/anthology/W12-6208
Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)
Richmond, K., Clark, R.A.J., Fitt, S.: Robust LTS rules with the Combilex speech technology lexicon. In: INTERSPEECH, pp. 1295–1298. ISCA (2009)
Sherif, T., Kondrak, G.: Substring-based transliteration. In: Carroll, J.A., van den Bosch, A., Zaenen, A. (eds.) ACL. Association for Computational Linguistics, Morristown (2007)
Smith, D.A., Rydberg-Cox, J.A., Crane, G.R.: The Perseus project: a digital library for the humanities. Literary and Linguistic Computing 15(1), 15–25 (2000). http://llc.oxfordjournals.org/content/15/1/15
Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Su, K.Y., Su, J., Wiebe, J. (eds.) ACL/IJCNLP, pp. 486–494. Association for Computational Linguistics, Morristown (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Eger, S. (2015). Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2015. Communications in Computer and Information Science, vol 537. Springer, Cham. https://doi.org/10.1007/978-3-319-23980-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-23980-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23978-1
Online ISBN: 978-3-319-23980-4
eBook Packages: Computer ScienceComputer Science (R0)