A dimensional retrieval model for integrating semantics and statistical evidence in context for genomics literature search

https://doi.org/10.1016/j.compbiomed.2008.11.002Get rights and content

Abstract

We present a dimensional information retrieval model for combining concept-based semantics and term statistics within multiple levels of document context to identify concise, variable length passages of text that answer a user query. Our results demonstrate improved search results in the presence of varying levels of semantic evidence, and higher performance using retrieval functions that combine document, as well as sentence and passage level information. Experimental results are promising. When ranking documents based on the most relevant extracted passages, the results exceed the state-of-the-art by 15.28% as assessed by the TREC 2005 Genomics track collection of 4.5 million MEDLINE citations.

Introduction

Biological research has been transformed by the explosion of scientific literature documenting the results of research facilitated by high-throughput techniques such as gene microarrays and efficient data acquisition methods. As a result, accurate retrieval of information from genomics text has become a key component in experiments for identifying the interplay between genes, proteins, and other biological and disease processes. In addition, the literature itself has become a source of experiments as researchers turn to it to search for knowledge that drives new hypotheses and research [1], [2], [3], [4]. Indeed, the National Institute of Health Roadmap [5] sites integration of research data as a fundamental component to foster future biomedical research.

Information retrieval in this domain can be challenging due to the wide variation of synonymous terms, acronyms, and morphological variants used for identifying the same biological concepts. For example, the term bovine spongiform encephalopathy represents a specific biological concept that can also be represented by several other terms including: BSE, MCD, Mad Cow Disease, JCD, CJD, and Creutzfeld-Jakob disease. The term Apoliprotein E can be represented by the acronym ApoE along with morphological variants Apo-E and Apo E. In addition, acronyms frequently have multiple meanings (polysemy) and require contextual clues for accurate disambiguation. In MEDLINE citations, the acronym IP is used to represent immunophenotype, intraperitoneally, immunoprecipitation, inositol phosphates, ischemic preconditioning, and inverted papilloma. And GF is used to represent germ free, grip force, GDNF family, griseofulvin, growth factor, gel filtration, glycolytic flux, growth fraction, granuloma faciale, gingival fibroblasts, and glia filament preparation.

We propose that an effective retrieval model for genomics literature search requires a systematic approach for integrating semantic, contextual, and statistical evidence. Our experimental results validate this approach, demonstrating improved search results in the presence of varying levels of semantic evidence, and higher performance using retrieval functions that combine document as well as sentence and passage level information versus using document, sentence, or passage level information alone.

We first provide a brief overview of genomics information retrieval, followed by the indexing model, the indexing process, our query processing methods, and the resulting retrieval model. This is followed by our experimental methods, results, and a discussion of related work.

Section snippets

Genomics information retrieval

Effective genomics information retrieval requires the ability to identify different biological terms representing the same concept, and context to accurately disambiguate terms and acronyms. For example, context captured within a paragraph or document where an acronym is defined can help disambiguate its meaning. Statistical measures of words co-occurring with ambiguous terms can be used to help disambiguate these concept terms.

Context for identifying relevant search results can be captured at

Dimensional indexing model

Most search indexes are based on a tabular 2-dimensional indexing model relating terms to the documents they appear in. In this model, commonly referred to as the “bag of words” model, we are unable to determine if terms co-occur within a narrower context such as a paragraph, sentence, or phrase.

This traditional 2D model is illustrated as a database schema in Fig. 1. The term Index table makes up one entity, the Documents table makes up a second entity, and the Postinglist table provides a 2D

Indexing process

The indexing process illustrated in Fig. 5 includes:

  • (1)

    Lexical partitioning: documents are parsed into sections (title, abstract, body text), and paragraphs. Paragraphs are parsed into sentences.

  • (2)

    Tokenization: acronyms and their long-forms are identified during indexing using the Schwartz and Hearst algorithm [15]. A long-short form would include “immuno deficiency enzyme (IDE)”, and a short-long form would include “IDE (immuno deficiency enzyme)”. The algorithm works backwards through the long

Query processing

Structured query generation, also shown in Fig. 5, is illustrated with the following query: “Provide information about the role of the gene PRNP (prion protein) in the disease Mad Cow Disease”.

  • (1)

    Sentences are extracted, and acronyms and their long-forms are identified: PRNP (PRioN protein).

  • (2)

    Part-of-speed tagging is performed using our second order statistical Hidden Markov Model tagger [18]:… role_NN of_II the_DD gene_NN PRNP_NN (_( prion_NN protein_NN )_) in_II the_DD disease_NN Mad_NN Cow_NN

Methods

We evaluated our system on the TREC (Text REtrieval Conference) 2005 Genomics Track ad-hoc retrieval task which uses a corpus of 4,591,008 MEDLINE citations (∼15 GB) and 49 query topics drawn from the information needs of molecular biology researchers [22]. To the best of our knowledge, the TREC-2005 Genomics collection is the largest collection of MEDLINE citations with relevance judgments. Each MEDLINE citation includes an article title, medical subject headings (MeSH), and typically includes

Baseline

The system delivered baseline results of 0.302 mean average precision (MAP) on the genomics collection for document retrieval using the BM25 retrieval function (k1=1.4, k3=7, b=0.75). This establishes a high-performing baseline for document retrieval which exceeds the top result from the 2005 TREC Genomics Track [22] of 0.288 by 4.9%. Establishing our own baseline is important so that we can independently assess the contributions of contextual and semantic search. We attribute much of the

Discussion and related work

Evaluating our system on the 2005 TREC Genomics track MEDLINE collection allows us to directly compare our methods and results with the top-performing systems on the largest collection of biomedical literature with relevance judgments [22]. Common among the top performing systems from York University [26], IBM [27], and our baseline retrieval model is the use of the probabilistic BM25 document retrieval function and techniques for normalizing biological terms. By exceeding the top result, our

Conclusion

We have presented a novel dimensional information retrieval model for effectively combining concept-based semantics, term statistics, and context for improving search precision of genomics literature. Our experimental results are statistically significant, and exceed the state-of-the-art by 15.28% as assessed by the TREC 2005 Genomics track.

The results demonstrate improved search results in the presence of varying levels of concept-based semantic evidence, and the model still performs at or

Summary

We present a dimensional information retrieval model for combining concept-based semantics and term statistics within multiple levels of document context for accurately identifying concise, variable length passages of text to answer a user query.

The system combines a dimensional data model for indexing scientific literature at multiple levels of document structure and context with a rule-based query processing algorithm. The query processing algorithm extracts and resolves biological concepts

Conflict of interest statement

None declared.

Jay Urbain is an Assistant Professor in the Electrical Engineering and Computer Science Department at the Milwaukee School of Engineering. He recently completed his Ph.D. in Computer Science at Illinois Institute of Technology where he was a member of the Information Retrieval Lab. His research interests include information retrieval, text mining, machine learning, and bioinformatics. He is a member of ACM, ASIS&T, AAAI, and IEEE.

References (37)

  • W. Stead

    Achievable steps toward building a National Health Information infrastructure in the United States

    Journal of the American Medical Informatics Association

    (2005)
  • S.B. Davidson et al.

    Challenges in integrating biological data sources

    Journal of Computational Biology

    (1995)
  • W.D. Fenstermacher

    Introduction to Bioinformatics

    Journal of the American Society for Information Science and Technology

    (2005)
  • W.J. MacMullen et al.

    Information problems in molecular biology and bioinformatics

    Journal of the American Society for Information Science & Technology

    (2005)
  • National Institute of Health Roadmap, 〈nihroadmap.nih.gov〉 (accessed...
  • M. Hearst, Multi-paragraph segmentation of expository text, in: Proceedings of the 32nd Meeting of the Association for...
  • A. Ittycheriah, S. Roukos, IBM's statistical question answering system, in: Proceedings of the 10th Text REtrieval...
  • M. Kaszkiel, J. Zobel, Passage retrieval revisited, in: Proceedings of the 20th Annual International ACM-SIGIR...
  • M. Kaszkiel et al.

    Effective ranking with arbitrary passages

    Journal of the American Society of Information Science

    (2001)
  • J. Lin, The role of information retrieval in answering complex questions, in: Proceedings of the 21st International...
  • S. Tellex, B. Katz, J. Lin, A. Fernandes, G. Marton, Quantitative evaluation of passage retrieval algorithms for...
  • R. White et al.

    Using top-ranking sentences to facilitate effective information access

    Journal of the American Society for Information Science and Technology

    (2005)
  • R. Kimball

    The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses

    (1996)
  • J. Gray et al.

    Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals

    Data Mining and Knowledge Discovery

    (1997)
  • A. Schwartz et al.

    A simple algorithm for identifying abbreviation definitions in biomedical text

    Pacific Symposium on Biocomputing, Kauai

    (2003)
  • J. Urbain, N. Goharian, A relational genomics search engine, The 2006 International Conference on Bioinformatics and...
  • M.F. Porter

    An algorithm for suffix stripping

    Program

    (1980)
  • MedPost, National Center for Biotechnology Information, 〈http://www.ncbi.nlm.nih.gov/staff/lsmith/MedPost.html〉...
  • Cited by (3)

    Jay Urbain is an Assistant Professor in the Electrical Engineering and Computer Science Department at the Milwaukee School of Engineering. He recently completed his Ph.D. in Computer Science at Illinois Institute of Technology where he was a member of the Information Retrieval Lab. His research interests include information retrieval, text mining, machine learning, and bioinformatics. He is a member of ACM, ASIS&T, AAAI, and IEEE.

    Nazli Goharian is Clinical Associate Professor of Computer Science and a member of the Information Retrieval Laboratory at the Illinois Institute of Technology. Her research interests cover search technology from information retrieval to secure information systems. She is a member of ACM & ASIS&T.

    Ophir Frieder is the Royden B. Davis Chair in Interdisciplinary Studies at Georgetown University. He is also the IITRI Chair Professor of Computer Science and the Director of the Information Retrieval Laboratory at the Illinois Institute of Technology, from which, he is currently on leave. He frequently consults for industry and government and for key intellectual property litigation. His research interests focus on scalable information retrieval systems spanning search and retrieval and communications issues. His systems are deployed in actual commercial and governmental production environments worldwide. He is a Fellow of the AAAS, ACM, IEEE, and the 2007 ASIST Research Award recipient.

    View full text