Zone analysis in biology articles as a basis for information extraction

https://doi.org/10.1016/j.ijmedinf.2005.06.013Get rights and content

Summary

In the field of biomedicine, an overwhelming amount of experimental data has become available as a result of the high throughput of research in this domain. The amount of results reported has now grown beyond the limits of what can be managed by manual means. This makes it increasingly difficult for the researchers in this area to keep up with the latest developments. Information extraction (IE) in the biological domain aims to provide an effective automatic means to dynamically manage the information contained in archived journal articles and abstract collections and thus help researchers in their work. However, while considerable advances have been made in certain areas of IE, pinpointing and organizing factual information (such as experimental results) remains a challenge. In this paper we propose tackling this task by incorporating into IE information about rhetorical zones, i.e. classification of spans of text in terms of argumentation and intellectual attribution. As the first step towards this goal, we introduce a scheme for annotating biological texts for rhetorical zones and provide a qualitative and quantitative analysis of the data annotated according to this scheme. We also discuss our preliminary research on automatic zone analysis, and its incorporation into our IE framework.

Introduction

Information extraction (IE) in the biological domain is now regarded as an essential technique for utilizing information contained in archived journal articles and abstract collections such as MEDLINE. Major domain databases offer large-scale archives of semi-structured results, but they tend to be incomplete or not up-to-date. For the latest results, as well as for confirmation of results reported in the database and for supplementary information, access to the latest literature is necessary. Thus, the significance of a more sensible management of facts, specifically an integration and update of experimental results, is self-evident. Given the limitations of manual work for such purposes in terms of both efficiency and accuracy, IE's focus on factual information is of a critical importance.

Recent intensive research in natural language processing in the biological domain (bioNLP) has made major progress in the extraction of bio-named entities and biological interactions (e.g. [1], [2], [3]), but further advancement aimed at pinpointing and organizing factual information remains a challenge. In particular, the important task of identifying new experimental results is complicated by the large number of statements made in each article that pertain to results in general, including references to previous work as well as technical details and conjectures. The same information (about a molecular event, for example) may be provided as a new result, as a previously known result, as a conjecture, etc., that is, in different rhetorical contexts. Because current IE relies on surface lexical and syntactic patterns, it is not sensitive to the rhetorical status of information. As a consequence, existing techniques tend to extract old results mixed with new ones, leaving the novel contribution unclear. We expect that a rhetorical analysis of biological texts and its incorporation into IE will provide a means to address this problem.

As the first step towards this goal, we proposed in [4] annotating biology texts in terms of rhetorical zones with a shallow nesting using a scheme modified from [5]. In [6], we gave a qualitative analysis of our sample hand-annotated data and described how such zones can be identified by humans. In this paper, we provide a comprehensive qualitative and quantitative analysis of the process and the results of zone analysis (ZA) in the hand-annotated data, describe some preliminary work on automatic ZA and discuss its future incorporation into our framework for IE.

The organization of the paper is as follows. We first discuss in more detail the motivations for ZA in biology, from an IE perspective and in the view of previous work (Section 2). We then introduce our framework. We describe general characteristics of biological articles and introduce our annotation scheme and annotated data (Section 3). We then shed light on the decision process in which zones can be identified by a human annotator and describe, mostly qualitatively, the main features required for identification of each zone class (Section 4). To summarize the results of the annotation work, we illustrate the distribution of zones in the annotated articles both visually and quantitatively (Section 5). Finally, we discuss ongoing and future work on automatic ZA and its integration to IE (Section 6).

Section snippets

Critical issues in bioNLP

We discuss below the critical issues in bioNLP involved in pinpointing and organizing factual information and describe how ZA can be applied to help this process.

First, we argue that the current IE techniques do not effectively allow us to distinguish between different kinds of factual information, although input data contains material sufficient for this purpose. In particular, biological articles include useful information about various rhetorical statuses (classification of text in terms of

Framework

For the reasons mentioned above, we have made some major modifications to the original scheme of [5]. These are for conceptual clarification and for a closer look at the author's own work focusing on the experimental results. In what follows, we first describe the general characteristics of biological articles and then introduce our annotation scheme in terms of zone classes and the principles of annotation.

Main features of each zone

This section describes the feature types and the decision process used by the human annotator and the main characteristics of each zone based on our annotation task and data.

Distribution of zones

Another kind of data interesting not only as a general description of the rhetorical structure of biological articles but also for the future purpose of interpreting results of automatic ZA and IE is quantitative data related to the distribution of zones in our 20 annotated articles. In the subsequent sections we first provide a visual illustration of the patterns in which the zones tend to appear (Section 5.1) and then show statistical data illustrating the distribution, nature and location of

Ongoing and future work

In this paper, we have described our zone annotation scheme for organizing experimental results in biological texts and provided a qualitative and quantitative analysis of the process and the results of ZA based on our hand-annotated sample of 20 articles. This is the first major step towards what is our ultimate goal: to use automatic ZA for the purpose of IE from biological data.

Using our annotated data as training material, we have recently started machine learning experiments for automatic

Conclusions

In this paper, we have focused on the problem that current IE techniques do not provide sufficient or effective means for managing the factual data (particularly data pertaining to experimental results) in the rapidly growing field of biology. We have proposed addressing this problem by means of rhetorical zone analysis. As the first step towards this goal, we have introduced an annotation scheme for biological texts, provided data annotated according to this scheme, and given a comprehensive

Acknowledgements

We gratefully acknowledge the helpful comments from the anonymous reviewers of earlier versions of the paper, and from Patrick Ruth, Udo Hahn and others in the audience of the JNLPBA Workshop held in conjunction with the COLING 2004 conference. We also thank Simone Teufel (University of Cambridge) and Noriko Kando (NII) for stimulating discussions. Thanks also go to the generous support of Prof. Asao Fujiyama (NII) and the partial financial support from the BioPortal Project performed through

References (16)

  • E. Liddy

    The discourse-level structure of empirical abstracts: an explanatory study

    Inform. Process. Manage.

    (1991)
  • M. Craven et al.

    Constructing biological knowledge bases by extracting information from text sources

  • K. Humphreys et al.

    Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures

  • L. Tanabe et al.

    Tagging gene and protein names in biomedical text

    Bioinformatics

    (2002)
  • Y. Mizuta et al.

    An annotation scheme for a rhetorical analysis of biology articles

  • S. Teufel et al.

    Summarizing scientific articles: experiments with relevance and rhetorical status

    Computational Linguistics

    (2002)
  • Y. Mizuta et al.

    Zone identification in biology articles as a basis for information extraction

  • J. Swales

    Genre Analysis

    (1990)
There are more references available in the full text version of this article.

Cited by (84)

  • Argumentation profiles and the manipulation of common ground. The arguments of populist leaders on Twitter

    2022, Journal of Pragmatics
    Citation Excerpt :

    The manual and automatic annotation of speech for detecting arguments (Dusmanu et al., 2017; Mochales Palau and Moens, 2009, 2011), fallacies (Habernal et al., 2018), and rhetorical strategies (Duthie et al., 2016) has been developed by combining different fields such as linguistics, argumentation, and IT. One of the challenges of argument mining is to go beyond the simple detection of arguments or argument structures (Mizuta et al., 2006) and capture the types of argument used in a text, represented as argumentation schemes (Walton et al., 2008). From the first attempts to distinguish automatically the most frequent types of argument from a specific corpus of already annotated arguments (Wei Feng and Hirst, 2011), argument mining has developed partially automated systems able to identify some (few) schemes (Green, 2018; Lawrence and Reed, 2016).

  • Strengthening move analysis methodology towards bridging the function-form gap

    2018, English for Specific Purposes
    Citation Excerpt :

    Our initial emphasis on function and content rather than form led to a non-sentence-based segmentation protocol, where the minimal unit of analysis was the proposition. This is in overall agreement with other clause-based functional models (Connor & Mauranen, 1999; Mizuta, Korhonen, Mullen, & Collier, 2005). In that respect, our methodological approach also contributes to the continuing debate about communicative function boundaries (Paltridge, 1994).

  • Research on the structure function recognition of PLOS

    2024, Frontiers in Artificial Intelligence
  • The representation of argumentation in scientific papers: A comparative analysis of two research areas

    2022, Journal of the Association for Information Science and Technology
View all citing articles on Scopus
View full text