Parsing Biomedical Literature

Lease, Matthew; Charniak, Eugene

doi:10.1007/11562214_6

Matthew Lease²² &
Eugene Charniak²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1718 Accesses

Abstract

We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.

We would like to thank the National Science Foundation for their support of this work (IIS-0112432, LIS-9721276, and DMS-0074276), as well as thank Sharon Goldwater and our anonymous reviewers for their valuable feeback.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The GENIA Corpus: Annotation Levels and Applications

Parsing clinical text: how good are the state-of-the-art parsers?

Article Open access 20 May 2015

Development of a Machine Learning Framework for Biomedical Text Mining

References

Kim, J.d., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics (Supplement: Eleventh International Conference on Intelligent Systems for Molecular Biology) 19, i180–i182 (2003)
Google Scholar
Tateisi, Y., Ohta, T., dong Kim, J., Hong, H., Jian, S., Tsujii, J.: The genia corpus: Medline abstracts annotated with linguistic information. In: Third meeting of SIG on Text Mining, Intelligent Systems for Molecular Biology, ISMB (2003)
Google Scholar
Charniak, E.: A maximum-entropy-inspired parser. In: Proc. NAACL, pp. 132–139 (2000)
Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Google Scholar
Collins, M.: Discriminative reranking for natural language parsing. In: Proc. ICML, pp. 175–182 (2000)
Google Scholar
Ratnaparkhi, A.: Learning to parse natural language with maximum entropy models. Machine Learning 34, 151–175 (1999)
Article MATH Google Scholar
Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 167–202 (2001)
Google Scholar
Roark, B., Bacchiani, M.: Supervised and unsupervised pcfg adaptation to novel domains. In: Proceedings of HLT-NAACL, pp. 205–212 (2003)
Google Scholar
Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Example selection for bootstrapping statistical parsers. In: Proceedings of HLT-NAACL, pp. 331–338 (2003)
Google Scholar
de Bruijn, B., Martin, J.: Literature mining in molecular biology. In: Proceedings of the European Federation for Medical Informatics (EFMI) Workshop on Natural Language Processing in Biomedical Applications (2002)
Google Scholar
Hirschman, L., Park, J., Tsujii, J., Wong, L., Wu, C.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18, 1553–1561 (2002)
Article Google Scholar
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)
Google Scholar
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, I.: Extracting human protein interactions from medline using a full-sentence parser. Bioinformatics 20, 604–611 (2004)
Article Google Scholar
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)
Article Google Scholar
Hwa, R.: Learning Probabilistic Lexicalized Grammars for Natural Language Processing. PhD thesis, Harvard University (2001)
Google Scholar
Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketting Guideliness for Treebank II style Penn Treebank Project. Linguistic Data Consortium (1995)
Google Scholar
Buckley, C.: Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University (1985)
Google Scholar
Goodman, J.: Parsing inside-out. PhD thesis, Harvard University (1998)
Google Scholar
McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care (SCAMC), pp. 235–239 (1994)
Google Scholar
Grover, C., Lapata, M., Lascarides, A.: A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering (2002)
Google Scholar
Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P.: Using predicate-argument structures for information extraction. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), pp. 8–15 (2003)
Google Scholar
Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank. In: Proc. of IJCNLP-2004, pp. 684–693 (2004)
Google Scholar
Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications, JNLPBA-2004 (2004)
Google Scholar
Charniak, E.: Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park (1997)
Google Scholar
Park, J.C.: Using combinatory categorical grammar to extract biomedical information. IEEE Intelligent Systems 16, 62–67 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Brown Laboratory for Linguistic Information Processing (BLLIP), Brown University, Providence, RI, USA
Matthew Lease & Eugene Charniak

Authors

Matthew Lease
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Charniak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lease, M., Charniak, E. (2005). Parsing Biomedical Literature. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_6

Download citation

DOI: https://doi.org/10.1007/11562214_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics