Universal Segmentation of Text with the Sumo Formalism

Quint, Julien

doi:10.1007/3-540-45154-4_2

Julien Quint^2,3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1835))

Included in the following conference series:

International Conference on Natural Language Processing

930 Accesses

Abstract

We propose a universal formalism for the segmentation of text documents called Sumo. Its main purpose is to help creating segmentation systems for documents in any language. Because the processing is independent of the language, any level of segmentation (be it character, word, sentence, paragraph, etc.) can be considered. We will argue about the usefulness of such a formalism, describe the framework for segmentation on which Sumo relies, and give detailed examples to demonstrate some of its features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Quint, J.: Towards a formalism for language-independent text segmentation. Proceedings of NLPRS’99 (1999) 404–408.
Google Scholar
Guo, J.: Critical Tokenization and its Properties. Computational Linguistics 23:4 (1997) 569–596.
Google Scholar
Palmer, D., Hearst, M.: Adaptative Multilingual Sentence Boundary Disambiguation. Computational Linguistics 23:2 (1997) 241–267.
Google Scholar
Habert, B., Adda, G., Adda-Decker, M., Boula de Marëuil, P., Ferrari, S., Ferret, O., Illouz, G., Paroubek, P.: Towards Tokenization Evaluation. Proceedings of LREC-98 (1998) 427–431.
Google Scholar
Aït-Mokhtar, S.: Du texte ASCII au texte lemmatisé: la présyntaxe en une seule etape. Proceedings of TALN-97 (1997) 60–69.
Google Scholar
Amtrup, J., Heine, H., Jost, U.: What’s in a Word Graph. Evaluation and Enhancement of Word Lattices. Verbmobil report 186 (1997). UNiversität Hamburg, Germany. http://www.dfki.de/.
Google Scholar
Colmerauer, A.: Les systémes Q ou un formalisme pour analyser et synthétiser des phrases sur ordinateur. Publication interne numéro 43 (1970). Université de Montréal.
Google Scholar
Mohri, M., Pereira, F., Riley, M.: Weighted Automata in Text and Speech Processing. Proceedings of the ECAI 96 Workshop (1996) 46–50.
Google Scholar
Planas, E.: TELA. Structures et algorithmes pour la Traduction Fondée sur la Mémoire. Thése d’Informatique (1998). Université Joseph Fourier, Grenoble, France.
Google Scholar
Roche, E.: Two Parsing Algorithms by Means of Finite-State Transducers. Proceedings of COLING-94 (1994) 431–435.
Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22:3 (1996) 377–404.
Google Scholar
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y.: Japanese Morphological Analysis System ChaSen version 2.0 Manual. Technical Report NAIST-IS-TR99009 (1999). Nara Institute of Science and Technology. Nara, Japan.
Google Scholar

Download references

Author information

Authors and Affiliations

GETA-CLIPS-IMAG, Université Joseph Fourier, BP 53, 38041, Grenoble Cedex 9, France
Julien Quint
Xerox Research Centre Europe, 6, chemin de Maupertuis, 38240, Meylan, France
Julien Quint

Authors

Julien Quint
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Engineering Department and Computer Technology Institute, University of Patras, 26500, Patras, Greece
Dimitris N. Christodoulakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Quint, J. (2000). Universal Segmentation of Text with the Sumo Formalism. In: Christodoulakis, D.N. (eds) Natural Language Processing — NLP 2000. NLP 2000. Lecture Notes in Computer Science(), vol 1835. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45154-4_2

Download citation

DOI: https://doi.org/10.1007/3-540-45154-4_2
Published: 25 May 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67605-8
Online ISBN: 978-3-540-45154-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics