Abstract
In this paper we present a study comparing the performance of different systems found in the literature that perform the task of automatic text segmentation in sentences for English documents. We also show the difficulties found to adapt these systems to make them work with Portuguese documents and the results obtained after the adaptation. We analyzed two systems that use a machine learning approach: MxTerminator and Satz, and a customized system based on fixed rules expressed by Regular Expressions. The results achieved by the Satz system were surprisingly positive for Portuguese documents.
This research was supported by the Brazilian PIBIC-CNPq Agency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Reynar, J., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)
Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23, 241–267 (1997)
Silla Jr., C.N., Valle Jr., J.D., Kaestner, C.A.A.: Automatic sentence detection using regulares expressions (in Portuguese). In: Proceedings of the 3rd Brazilian Computer Science Congress, ItajaÃ, SC, Brazil, pp. 548–560 (2003)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Palmer, D.D.: SATZ - an adaptive sentence segmentation system. Master’s thesis (1994)
Witten, I.H., Frank, B.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Wiley-Interscience, San Francisco (1999)
Aluisio, S.M., Pinheiro, G.M., Finger, Nunes, M.G.V., Tagnin, S.E.: The lacio-web project: overview and issues in brazilian portuguese corpora creation. In: Proceedings of the Corpus Linguistics 2003, vol. 16, pp. 14–21 (2003)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Silla, C.N., Kaestner, C.A.A. (2004). An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive