Abstract
We ran both Brill’s rule-based tagger and TnT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TnT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TnT on a large annotated medical text corpus, with a slightly extended tagset that captures certain medical language particularities, and achieved 98% tagging accuracy. Hence, statistical off-the-shelf POS taggers cannot only be immediately reused for medical NLP, but they also achieve – when trained on medical corpora – a higher performance level than for the newspaper genre.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Friedman, C., Hripcsak, G.: Natural language processing and its future in medicine. Academic Medicine 74, 890–895 (1999)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The PENN TREEBANK. Computational Linguistics 19, 313–330 (1993)
Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 543–565 (1995)
Brants, T.: TNT: A statistical part-of-speech tagger. In: Proceedings of the 6th Conference on Applied NLP, Seattle, WA, pp. 224–231 (2000)
Campbell, D.A., Johnson, S.B.: Comparing syntactic complexity in medical and non-medical corpora. In: Proceedings of the Annual Symposium of the American Medical Informatics Association – AMIA, Washington, D.C, pp. 90–94 (2001)
Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An annotation scheme for free word order languages. In: Proc. 5th Conference on Applied NLP, Washington, D.C, pp. 88–95 (1997)
Kilgarriff, A.: Comparing corpora. Intl. Journal of Corpus Linguistics 6, 97–133 (2001)
Wermter, J., Hahn, U.: An annotated German-language medical text corpus as language resource. In: Proceedings 4th International LREC Conference, Lisbon, Portugal (2004)
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in NLP, Philadelphia, PA, pp. 133–142 (1996)
Giménez, J., Màrquez, L.: Fast and accurate part-of-speech tagging: The SVM approach revisited. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing – RANLP 2003, Borovets, Bulgaria (2003)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the HLT and the 3rd Conference of the North American Chapter of the ACL, Edmonton, Canada, pp. 252–259 (2003)
Samuelsson, C., Voutilainen, A.: Comparing a linguistic and a stochastic tagger. In: Proceedings of the 35th Annual Meeting of the ACL & 8th Conference of the European Chapter of the ACL, Madrid, Spain, pp. 246–253 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hahn, U., Wermter, J. (2004). Tagging Medical Documents with High Accuracy. In: Zhang, C., W. Guesgen, H., Yeap, WK. (eds) PRICAI 2004: Trends in Artificial Intelligence. PRICAI 2004. Lecture Notes in Computer Science(), vol 3157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28633-2_90
Download citation
DOI: https://doi.org/10.1007/978-3-540-28633-2_90
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22817-2
Online ISBN: 978-3-540-28633-2
eBook Packages: Springer Book Archive