Hostname: page-component-76fb5796d-2lccl Total loading time: 0 Render date: 2024-04-27T06:27:50.292Z Has data issue: false hasContentIssue false

The importance of the lexicon in tagging biological text

Published online by Cambridge University Press:  14 December 2005

LAWRENCE H. SMITH
Affiliation:
National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA e-mail: lsmith@ncbi.nlm.nih.gov, wilbur@ncbi.nlm.nih.gov
THOMAS C. RINDFLESCH
Affiliation:
Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD, USAtcr@nlm.nih.gov
W. JOHN WILBUR
Affiliation:
National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA e-mail: lsmith@ncbi.nlm.nih.gov, wilbur@ncbi.nlm.nih.gov

Abstract

A part-of-speech tagger is a fundamental and indispensable tool in computational linguistics, typically employed at the critical early stages of processing. Although taggers are widely available that achieve high accuracy in very general domains, these do not perform nearly as well when applied to novel specialized domains, and this is especially true with biological text. We present a stochastic tagger that achieves over 97.44% accuracy on MEDLINE abstracts. A primary component of the tagger is its lexicon which enumerates the permitted parts-of-speech for the 10000 words most frequently occurring in MEDLINE. We present evidence for the conclusion that the lexicon is as vital to tagger accuracy as a training corpus, and more important than previously thought.

Type
Papers
Copyright
2005 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)