A Comparative Evaluation of a New Unsupervised Sentence Boundary Detection Approach on Documents in English and Portuguese

Strunk, Jan; Silla, Carlos N.; Kaestner, Celso A. A.

doi:10.1007/11671299_16

Jan Strunk¹⁷,
Carlos N. Silla Jr¹⁸ &
Celso A. A. Kaestner¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1497 Accesses

Abstract

In this paper, we describe a new unsupervised sentence boundary detection system and present a comparative study evaluating its performance against different systems found in the literature that have been used to perform the task of automatic text segmentation into sentences for English and Portuguese documents. The results achieved by this new approach were as good as those of the previous systems, especially considering that the method does not require any additional training resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improving Efficiency of Sentence Boundary Detection by Feature Selection

Sociolinguistic Factors in Text-Based Sentence Boundary Detection

WiSeBE: Window-Based Sentence Boundary Evaluation

References

Lyman, P., Varian, H.R.: How much information. Retrieved on [01/19/2004] (2003), from http://www.sims.berkeley.edu/how-much-info-2003
Kiss, T., Strunk, J.: Multilingual unsupervised sentence boundary detection (Under Review), http://www.linguistics.rub.de/~strunk/ks2005FINAL.pdf
Silla Jr., C.N., Kaestner, C.A.A.: An analysis of sentence boundary detection systems for English and Portuguese documents. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 135–141. Springer, Heidelberg (2004)
Chapter Google Scholar
Reynar, J., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)
Google Scholar
Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267 (1997)
Google Scholar
Kiss, T., Strunk, J.: Scaled log likelihood ratios for the detection of abbreviations in text corpora. In: Proceedings of COLING 2002, Taipei, pp. 1228–1232 (2002)
Google Scholar
Kiss, T., Strunk, J.: Viewing sentence boundary detection as collocation identification. In: Proceedings of KONVENS 2002, Saarbrücken, pp. 75–82 (2002)
Google Scholar
Nunberg, G.: The Linguistics of Punctuation. In: CSLI Lecture Notes. Center for the Study of Language and Information, Stanford, California, vol. 18 (1990)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Google Scholar
Aluisio, S.M., Pinheiro, G.M., Finger, M., Nunes, M.G.V., Tagnin, S.E.: The Lacio-Web Project: Overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of Corpus Linguistics 2003, pp. 14–21 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Sprachwissenschaftliches Institut, Ruhr-Universität Bochum, 44780, Bochum, Germany
Jan Strunk
Pontifical Catholic University of Paraná, Rua Imaculada Conceição 1155, 80215-901, Curitiba, Brazil
Carlos N. Silla Jr & Celso A. A. Kaestner

Authors

Jan Strunk
View author publications
You can also search for this author in PubMed Google Scholar
Carlos N. Silla Jr
View author publications
You can also search for this author in PubMed Google Scholar
Celso A. A. Kaestner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Strunk, J., Silla, C.N., Kaestner, C.A.A. (2006). A Comparative Evaluation of a New Unsupervised Sentence Boundary Detection Approach on Documents in English and Portuguese. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_16

Download citation

DOI: https://doi.org/10.1007/11671299_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics