Abstract
In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple CRF binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to ‘classic’ schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)
Curran, J.R., Clark, S., Vadas, D.: Multi-tagging for lexicalized-grammar parsing. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Meeting of the Association for Computational Linguistics, pp. 697–704. Association for Computational Linguistics, Sydney (2006)
Dridan, R., Oepen, S.: Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, pp. 378–382 (July 2012)
Flickinger, D.: On building a more efficient grammar by exploiting types. Natural Language Engineering 6(1), 15–28 (2000)
Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank: A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories (TLT 2011), Lisbon, Portugal (2012)
Foster, J.: “cba to check the spelling”: Investigating parser performance on discussion forum posts. In: Human Language Technology Conference: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 381–384. Association for Computational Linguistics, Los Angeles (2010)
Green, S., de Marneffe, M.C., Bauer, J., Manning, C.D.: Multiword expression identification with tree substitution grammars: A parsing tour de force with french. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 725–735. Association for Computational Linguistics, Edinburgh (2011)
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes. The 90% solution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York City, USA, pp. 57–60 (June 2006)
Kaplan, R.M.: A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In: Arppe, A., Carlson, L., Lindén, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., Yli-Jyrä, A. (eds.) Inquiries into Words, Constraints and Contexts, pp. 55–64. CSLI Publications, Stanford (2005)
Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: GENIA corpus — a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180–i182 (2003)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 504–513 (July 2010)
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Maršík, J., Bojar, O.: TrTok: A Fast and Trainable Tokenizer for Natural Languages. Prague Bulletin of Mathematical Linguistics 98, 75–85 (2012)
Oepen, S., Flickinger, D., Toutanova, K., Manning, C.D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation 2(4), 575–596 (2004)
Øvrelid, L., Velldal, E., Oepen, S.: Syntactic scope resolution in uncertainty analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1379–1387. Association for Computational Linguistics, Stroudsburg (2010)
Petrov, S., McDonald, R.: Overview of the 2012 shared task on parsing the web. In: Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language, SANCL (2012)
Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Studies in Contemporary Linguistics. Contemporary Linguistics. The University of Chicago Press, Chicago (1994)
Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., Nivre, J.: The CoNLL 2008 shared task on joint parsing of syntactic and semantic dependencies. In: Proceedings of the 12th Conference on Natural Language Learning, Manchester, England, pp. 159–177 (2008)
Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, Melbourne, Australia, pp. 49–57 (2007)
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Yoshida, K., Tsuruoka, Y., Miyao, Y., Tsujii, J.: Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1783–1788. Morgan Kaufmann Publishers Inc., Hyderabad (2007)
Ytrestøl, G.: Cuteforce. Deep deterministic HPSG parsing. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 186–197 (2011)
Zhang, Y., Krieger, H.U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 198–208 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fares, M., Oepen, S., Zhang, Y. (2013). Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)