Abstract
In this paper, we describe a new unsupervised sentence boundary detection system and present a comparative study evaluating its performance against different systems found in the literature that have been used to perform the task of automatic text segmentation into sentences for English and Portuguese documents. The results achieved by this new approach were as good as those of the previous systems, especially considering that the method does not require any additional training resources.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Lyman, P., Varian, H.R.: How much information. Retrieved on [01/19/2004] (2003), from http://www.sims.berkeley.edu/how-much-info-2003
Kiss, T., Strunk, J.: Multilingual unsupervised sentence boundary detection (Under Review), http://www.linguistics.rub.de/~strunk/ks2005FINAL.pdf
Silla Jr., C.N., Kaestner, C.A.A.: An analysis of sentence boundary detection systems for English and Portuguese documents. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 135–141. Springer, Heidelberg (2004)
Reynar, J., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)
Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267 (1997)
Kiss, T., Strunk, J.: Scaled log likelihood ratios for the detection of abbreviations in text corpora. In: Proceedings of COLING 2002, Taipei, pp. 1228–1232 (2002)
Kiss, T., Strunk, J.: Viewing sentence boundary detection as collocation identification. In: Proceedings of KONVENS 2002, Saarbrücken, pp. 75–82 (2002)
Nunberg, G.: The Linguistics of Punctuation. In: CSLI Lecture Notes. Center for the Study of Language and Information, Stanford, California, vol. 18 (1990)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Aluisio, S.M., Pinheiro, G.M., Finger, M., Nunes, M.G.V., Tagnin, S.E.: The Lacio-Web Project: Overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of Corpus Linguistics 2003, pp. 14–21 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Strunk, J., Silla, C.N., Kaestner, C.A.A. (2006). A Comparative Evaluation of a New Unsupervised Sentence Boundary Detection Approach on Documents in English and Portuguese. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_16
Download citation
DOI: https://doi.org/10.1007/11671299_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)