Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes

Fares, Murhaf; Oepen, Stephan; Zhang, Yi

doi:10.1007/978-3-642-37247-6_19

Murhaf Fares¹⁷,
Stephan Oepen¹⁷ &
Yi Zhang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2404 Accesses

Abstract

In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple CRF binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to ‘classic’ schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)
Google Scholar
Curran, J.R., Clark, S., Vadas, D.: Multi-tagging for lexicalized-grammar parsing. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Meeting of the Association for Computational Linguistics, pp. 697–704. Association for Computational Linguistics, Sydney (2006)
Google Scholar
Dridan, R., Oepen, S.: Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, pp. 378–382 (July 2012)
Google Scholar
Flickinger, D.: On building a more efficient grammar by exploiting types. Natural Language Engineering 6(1), 15–28 (2000)
Article Google Scholar
Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank: A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories (TLT 2011), Lisbon, Portugal (2012)
Google Scholar
Foster, J.: “cba to check the spelling”: Investigating parser performance on discussion forum posts. In: Human Language Technology Conference: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 381–384. Association for Computational Linguistics, Los Angeles (2010)
Google Scholar
Green, S., de Marneffe, M.C., Bauer, J., Manning, C.D.: Multiword expression identification with tree substitution grammars: A parsing tour de force with french. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 725–735. Association for Computational Linguistics, Edinburgh (2011)
Google Scholar
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes. The 90% solution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York City, USA, pp. 57–60 (June 2006)
Google Scholar
Kaplan, R.M.: A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In: Arppe, A., Carlson, L., Lindén, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., Yli-Jyrä, A. (eds.) Inquiries into Words, Constraints and Contexts, pp. 55–64. CSLI Publications, Stanford (2005)
Google Scholar
Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: GENIA corpus — a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180–i182 (2003)
Article Google Scholar
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 504–513 (July 2010)
Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Google Scholar
Maršík, J., Bojar, O.: TrTok: A Fast and Trainable Tokenizer for Natural Languages. Prague Bulletin of Mathematical Linguistics 98, 75–85 (2012)
Google Scholar
Oepen, S., Flickinger, D., Toutanova, K., Manning, C.D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation 2(4), 575–596 (2004)
Article Google Scholar
Øvrelid, L., Velldal, E., Oepen, S.: Syntactic scope resolution in uncertainty analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1379–1387. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Petrov, S., McDonald, R.: Overview of the 2012 shared task on parsing the web. In: Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language, SANCL (2012)
Google Scholar
Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Studies in Contemporary Linguistics. Contemporary Linguistics. The University of Chicago Press, Chicago (1994)
Google Scholar
Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., Nivre, J.: The CoNLL 2008 shared task on joint parsing of syntactic and semantic dependencies. In: Proceedings of the 12th Conference on Natural Language Learning, Manchester, England, pp. 159–177 (2008)
Google Scholar
Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, Melbourne, Australia, pp. 49–57 (2007)
Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Chapter Google Scholar
Yoshida, K., Tsuruoka, Y., Miyao, Y., Tsujii, J.: Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1783–1788. Morgan Kaufmann Publishers Inc., Hyderabad (2007)
Google Scholar
Ytrestøl, G.: Cuteforce. Deep deterministic HPSG parsing. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 186–197 (2011)
Google Scholar
Zhang, Y., Krieger, H.U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 198–208 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Institutt for Informatikk, Universitetet i Oslo, Norway
Murhaf Fares & Stephan Oepen
LT-Lab, German Research Center for Artificial Intelligence, Germany
Yi Zhang

Authors

Murhaf Fares
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Oepen
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fares, M., Oepen, S., Zhang, Y. (2013). Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-37247-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics