Abstract
We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.
We would like to thank the National Science Foundation for their support of this work (IIS-0112432, LIS-9721276, and DMS-0074276), as well as thank Sharon Goldwater and our anonymous reviewers for their valuable feeback.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kim, J.d., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics (Supplement: Eleventh International Conference on Intelligent Systems for Molecular Biology) 19, i180–i182 (2003)
Tateisi, Y., Ohta, T., dong Kim, J., Hong, H., Jian, S., Tsujii, J.: The genia corpus: Medline abstracts annotated with linguistic information. In: Third meeting of SIG on Text Mining, Intelligent Systems for Molecular Biology, ISMB (2003)
Charniak, E.: A maximum-entropy-inspired parser. In: Proc. NAACL, pp. 132–139 (2000)
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Collins, M.: Discriminative reranking for natural language parsing. In: Proc. ICML, pp. 175–182 (2000)
Ratnaparkhi, A.: Learning to parse natural language with maximum entropy models. Machine Learning 34, 151–175 (1999)
Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 167–202 (2001)
Roark, B., Bacchiani, M.: Supervised and unsupervised pcfg adaptation to novel domains. In: Proceedings of HLT-NAACL, pp. 205–212 (2003)
Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Example selection for bootstrapping statistical parsers. In: Proceedings of HLT-NAACL, pp. 331–338 (2003)
de Bruijn, B., Martin, J.: Literature mining in molecular biology. In: Proceedings of the European Federation for Medical Informatics (EFMI) Workshop on Natural Language Processing in Biomedical Applications (2002)
Hirschman, L., Park, J., Tsujii, J., Wong, L., Wu, C.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18, 1553–1561 (2002)
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, I.: Extracting human protein interactions from medline using a full-sentence parser. Bioinformatics 20, 604–611 (2004)
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)
Hwa, R.: Learning Probabilistic Lexicalized Grammars for Natural Language Processing. PhD thesis, Harvard University (2001)
Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketting Guideliness for Treebank II style Penn Treebank Project. Linguistic Data Consortium (1995)
Buckley, C.: Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University (1985)
Goodman, J.: Parsing inside-out. PhD thesis, Harvard University (1998)
McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care (SCAMC), pp. 235–239 (1994)
Grover, C., Lapata, M., Lascarides, A.: A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering (2002)
Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P.: Using predicate-argument structures for information extraction. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), pp. 8–15 (2003)
Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank. In: Proc. of IJCNLP-2004, pp. 684–693 (2004)
Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications, JNLPBA-2004 (2004)
Charniak, E.: Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park (1997)
Park, J.C.: Using combinatory categorical grammar to extract biomedical information. IEEE Intelligent Systems 16, 62–67 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lease, M., Charniak, E. (2005). Parsing Biomedical Literature. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_6
Download citation
DOI: https://doi.org/10.1007/11562214_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)