Simulating Morphological Analyzers with Stochastic Taggers for Confidence Estimation

Monson, Christian; Hollingshead, Kristy; Roark, Brian

doi:10.1007/978-3-642-15754-7_78

Christian Monson²³,
Kristy Hollingshead²³ &
Brian Roark²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6241))

Included in the following conference series:

Workshop of the Cross-Language Evaluation Forum for European Languages

717 Accesses
1 Citations

Abstract

We propose a method for providing stochastic confidence estimates for rule-based and black-box natural language (NL) processing systems. Our method does not require labeled training data: We simply train stochastic models on the output of the original NL systems. Numeric confidence estimates enable both minimum Bayes risk–style optimization as well as principled system combination for these knowledge-based and black-box systems. In our specific experiments, we enrich ParaMor, a rule-based system for unsupervised morphology induction, with probabilistic segmentation confidences by training a statistical natural language tagger to simulate ParaMor’s morphological segmentations. By adjusting the numeric threshold above which the simulator proposes morpheme boundaries, we improve F₁ of morpheme identification on a Hungarian corpus by 5.9% absolute. With numeric confidences in hand, we also combine ParaMor’s segmentation decisions with those of a second (black-box) unsupervised morphology induction system, Morfessor. Our joint ParaMor-Morfessor system enhances F₁ performance by a further 3.4% absolute, ultimately moving F₁ from 41.4% to 50.7%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Multi-purpose Bayesian Model for Word-Based Morphology

Impact of Morphological Segmentation on Pre-trained Language Models

Learning Morphology of Natural Language as a Finite-State Grammar

References

Oflazer, K., El-Kahlout, İ.D.: Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation. In: Statistical MT Workshop at ACL (2007)
Google Scholar
Creutz, M.: Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition. Ph.D. Thesis, Computer and Information Science, Report D13, Helsinki, University of Technology, Espoo, Finland (2006)
Google Scholar
Harris, Z.: From Phoneme to Morpheme. Language, 31(2), 190-222 (1955); Reprinted in Harris, Z.: Papers in Structural and Transformational Linguists. Reidel D. (ed.), Dordrecht (1970)
Google Scholar
Bernhard, D.: Simple Morpheme Labeling in Unsupervised Morpheme Analysis. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 873–880. Springer, Heidelberg (2008)
Chapter Google Scholar
Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27(2), 153–198 (2001)
Article MathSciNet Google Scholar
Snyder, B., Barzilay, R.: Unsupervised Multilingual Learning for Morphological Segmentation. In: Proceedings of ACL 2008: HLT (2008)
Google Scholar
Poon, H., Cherry, C., Toutanova, K.: Unsupervised Morphological Segmentation with Log-Linear Models. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL (2009)
Google Scholar
Monson, C.: ParaMor: From Paradigm Structure to Natural Language Morphology Induction. Ph.D. Thesis, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania (2009)
Google Scholar
Monson, C., Carbonell, J., Lavie, A., Levin, L.: ParaMor and Morpho Challenge 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 967–974. Springer, Heidelberg (2009)
Google Scholar
Tjong Kim Sang, E. F.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL 2002 (2002)
Google Scholar
Tjong Kim Sang, E. F., Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. In: Computational Natural Language Learning, CoNLL (2000)
Google Scholar
Roark, B., Hollingshead, K.: Linear Complexity Context-Free Parsing Pipelines via Chart Constraints. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL (2009)
Google Scholar
Xue, N.: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–47 (2003)
Google Scholar
Hollingshead, K., Fisher, S., Roark, B.: Comparing and Combining Finite-State and Context-Free Parsers. In: Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, HLT/EMNLP (2005)
Google Scholar
Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2002)
Google Scholar
Kurimo, M., Virpioja, S., Turunen, V.T., Blackwood, G.W., Byrne, W.: Overview and Results of Morpho Challenge 2009. In: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, Revised Selected Papers. LNCS, Springer, Heidelberg (2010)
Google Scholar
Trón, V., Gyepesi, G., Halácsy, P., Kornai, A., Németh, L., Varga, D.: Hunmorph: Open Source Word Analysis. In: ACL Workshop on Software (2005)
Google Scholar
Varga, D., Halácsy, P., Kornai, A., Németh, L., Trón, V., Váradi, T., Sass, B., Bottyán, G., Héja, E., Gyarmati, Á., Mészáros, Á., Labundy, D.: Hunglish corpus, http://mokk.bme.hu/resources/hunglishcorpus (accessed on August 18, 2009)

Download references

Author information

Authors and Affiliations

Center for Spoken Language Understanding, Oregon Health & Science University, USA
Christian Monson, Kristy Hollingshead & Brian Roark

Authors

Christian Monson
View author publications
You can also search for this author in PubMed Google Scholar
Kristy Hollingshead
View author publications
You can also search for this author in PubMed Google Scholar
Brian Roark
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISTI-CNR, Area Ricerca CNR, Via Moruzzi, 1, 56124, Pisa, Italy
Carol Peters
Department of Information Engineering, University of Padua, Via gradenigo, 6/a, 35131, Padova, Italy
Giorgio Maria Di Nunzio
Aalto Univesity, P.O. Box 15400, 00076, Aalto, Finland
Mikko Kurimo
University of Hildesheim, 31141, Hildesheim, Germany
Thomas Mandl
ELDA/ELRA, 75013, Paris, France
Djamel Mostefa
LSI-UNED, 28040, Madrid, Spain
Anselmo Peñas
Matrixware, 1060, Vienna, Austria
Giovanna Roda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monson, C., Hollingshead, K., Roark, B. (2010). Simulating Morphological Analyzers with Stochastic Taggers for Confidence Estimation. In: Peters, C., et al. Multilingual Information Access Evaluation I. Text Retrieval Experiments. CLEF 2009. Lecture Notes in Computer Science, vol 6241. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15754-7_78

Download citation

DOI: https://doi.org/10.1007/978-3-642-15754-7_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15753-0
Online ISBN: 978-3-642-15754-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics