Abstract
Identifying sentence boundaries is an indispensable task for most natural language processing (NLP) systems. While extensive efforts have been devoted to mine biomedical text using NLP techniques, few attempts are specifically targeted at disambiguating sentence boundaries in biomedical literature, which has a number of unique features that can reduce the accuracy of algorithms designed for general English genre significantly. In order to increase the accuracy of sentence boundary identification for biomedical literature, we developed a method using a combination of heuristic and statistical strategies. Our approach does not require part-of-speech taggers or training procedures. Experiments with biomedical test corpora show our system significantly outperforms existing sentence boundary determination algorithms, particularly for full text biomedical literature. Our system is very fast and it should also be easily adaptable for sentence boundary determination in scientific literature from non-biomedical fields.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
PubMed: http://www.ncbi.nlm.nih.gov/entrez (2006)
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics 21(4), 543–565 (1995)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, USA (1991)
Choi, F.Y.Y.: Advances in Domain Independent Linear Text Segmentation. In: Proceedings of NAACL, Seattle, WA, USA (2000)
Nallapati, R., Allan, J.: Capturing Term Dependencies Using a Sentence Tree Based Language Model. In: Proceedings of CIKM ’02 conference, McLean, VA, USA (2002)
Ponte, J.M., Croft, W.B.: Text Segmentation by Topic. In: European Conference on Digital Libraries, Pisa, Italy (1997)
Cheery, L.L., Vesterman, W.: Writing Tools - The STYLE and DICTION Programs. In: 4.4 BSD User’s Supplementary Documents, Computer Science Research Group, Berkeley, CA, USA (1994)
Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., Vilain, M.: MITRE: Description of The Alembicsystem Used for MUC-6. In: Proceedings of the 6th message understanding conference, Columbia, MD, USA (1995)
Palmer, D.D., Hearst, M.A.: Adaptive Sentence Boundary Disambiguation. In: Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, Germany (1994)
Humphrey, T.L., Zhou, F.: Period Disambiguation Using a Neural Network. In: International Joint Conference on Neural Networks, Washington, DC, USA (1989)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Extraction of Rules For Sentence Boundary Disambiguation. In: Proceedings of the Workshop in Machine Learning in Human Language Technology, Chania, Greece (1999)
Mikheev, A.: Tagging Sentence Boundaries. In: Proceedings of NAACL, Seattle, WA, USA (2000)
Reynar, J.C., Ratnaparkhi, A.: A Maximum Entropy Approach to Identifying Sentence Boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, USA (1997)
Humphreys, B.L., Lindberg, D.A.B., M., S.H., O., B.G.: The Unified Medical Language System: An informatics research collaboration. Journal of the American Medical Informatics Association 5(1), 1–11 (1998)
Pruitt, K.D., Maglott, D.R.: RefSeq and LocusLink: NCBI Gene-Centered Resources. Nucleic acids research 29(1), 137–140 (2001)
ISI: Journal Citation Reports (2003), http://www.isinet.com
Aronson, A.R.: Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. In: Proceedings of AMIA Annual Symposium, Washington, DC, USA (2001)
Xuan, W., Watson, S.J., Akil, H., Meng, F.: Identifying Gene and Protein Names from Biological Texts. In: Proceedings of Computational Systems Bioinformatics, Stanford, CA, USA (2003)
Blaschke, C., A., M.A., Ouzounis, C., Valencia, A.: Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. In: Proceedings of the AAAI Conference on Intelligent Systems in Molecular Biology, Bethesda, MD, USA (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xuan, W., Watson, S.J., Meng, F. (2007). Tagging Sentence Boundaries in Biomedical Literature. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-70939-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)