Abstract
We propose a universal formalism for the segmentation of text documents called Sumo. Its main purpose is to help creating segmentation systems for documents in any language. Because the processing is independent of the language, any level of segmentation (be it character, word, sentence, paragraph, etc.) can be considered. We will argue about the usefulness of such a formalism, describe the framework for segmentation on which Sumo relies, and give detailed examples to demonstrate some of its features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Quint, J.: Towards a formalism for language-independent text segmentation. Proceedings of NLPRS’99 (1999) 404–408.
Guo, J.: Critical Tokenization and its Properties. Computational Linguistics 23:4 (1997) 569–596.
Palmer, D., Hearst, M.: Adaptative Multilingual Sentence Boundary Disambiguation. Computational Linguistics 23:2 (1997) 241–267.
Habert, B., Adda, G., Adda-Decker, M., Boula de Marëuil, P., Ferrari, S., Ferret, O., Illouz, G., Paroubek, P.: Towards Tokenization Evaluation. Proceedings of LREC-98 (1998) 427–431.
Aït-Mokhtar, S.: Du texte ASCII au texte lemmatisé: la présyntaxe en une seule etape. Proceedings of TALN-97 (1997) 60–69.
Amtrup, J., Heine, H., Jost, U.: What’s in a Word Graph. Evaluation and Enhancement of Word Lattices. Verbmobil report 186 (1997). UNiversität Hamburg, Germany. http://www.dfki.de/.
Colmerauer, A.: Les systémes Q ou un formalisme pour analyser et synthétiser des phrases sur ordinateur. Publication interne numéro 43 (1970). Université de Montréal.
Mohri, M., Pereira, F., Riley, M.: Weighted Automata in Text and Speech Processing. Proceedings of the ECAI 96 Workshop (1996) 46–50.
Planas, E.: TELA. Structures et algorithmes pour la Traduction Fondée sur la Mémoire. Thése d’Informatique (1998). Université Joseph Fourier, Grenoble, France.
Roche, E.: Two Parsing Algorithms by Means of Finite-State Transducers. Proceedings of COLING-94 (1994) 431–435.
Sproat, R., Shih, C., Gale, W., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22:3 (1996) 377–404.
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y.: Japanese Morphological Analysis System ChaSen version 2.0 Manual. Technical Report NAIST-IS-TR99009 (1999). Nara Institute of Science and Technology. Nara, Japan.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Quint, J. (2000). Universal Segmentation of Text with the Sumo Formalism. In: Christodoulakis, D.N. (eds) Natural Language Processing — NLP 2000. NLP 2000. Lecture Notes in Computer Science(), vol 1835. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45154-4_2
Download citation
DOI: https://doi.org/10.1007/3-540-45154-4_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67605-8
Online ISBN: 978-3-540-45154-9
eBook Packages: Springer Book Archive