A dimensional retrieval model for integrating semantics and statistical evidence in context for genomics literature search
Introduction
Biological research has been transformed by the explosion of scientific literature documenting the results of research facilitated by high-throughput techniques such as gene microarrays and efficient data acquisition methods. As a result, accurate retrieval of information from genomics text has become a key component in experiments for identifying the interplay between genes, proteins, and other biological and disease processes. In addition, the literature itself has become a source of experiments as researchers turn to it to search for knowledge that drives new hypotheses and research [1], [2], [3], [4]. Indeed, the National Institute of Health Roadmap [5] sites integration of research data as a fundamental component to foster future biomedical research.
Information retrieval in this domain can be challenging due to the wide variation of synonymous terms, acronyms, and morphological variants used for identifying the same biological concepts. For example, the term bovine spongiform encephalopathy represents a specific biological concept that can also be represented by several other terms including: BSE, MCD, Mad Cow Disease, JCD, CJD, and Creutzfeld-Jakob disease. The term Apoliprotein E can be represented by the acronym ApoE along with morphological variants Apo-E and Apo E. In addition, acronyms frequently have multiple meanings (polysemy) and require contextual clues for accurate disambiguation. In MEDLINE citations, the acronym IP is used to represent immunophenotype, intraperitoneally, immunoprecipitation, inositol phosphates, ischemic preconditioning, and inverted papilloma. And GF is used to represent germ free, grip force, GDNF family, griseofulvin, growth factor, gel filtration, glycolytic flux, growth fraction, granuloma faciale, gingival fibroblasts, and glia filament preparation.
We propose that an effective retrieval model for genomics literature search requires a systematic approach for integrating semantic, contextual, and statistical evidence. Our experimental results validate this approach, demonstrating improved search results in the presence of varying levels of semantic evidence, and higher performance using retrieval functions that combine document as well as sentence and passage level information versus using document, sentence, or passage level information alone.
We first provide a brief overview of genomics information retrieval, followed by the indexing model, the indexing process, our query processing methods, and the resulting retrieval model. This is followed by our experimental methods, results, and a discussion of related work.
Section snippets
Genomics information retrieval
Effective genomics information retrieval requires the ability to identify different biological terms representing the same concept, and context to accurately disambiguate terms and acronyms. For example, context captured within a paragraph or document where an acronym is defined can help disambiguate its meaning. Statistical measures of words co-occurring with ambiguous terms can be used to help disambiguate these concept terms.
Context for identifying relevant search results can be captured at
Dimensional indexing model
Most search indexes are based on a tabular 2-dimensional indexing model relating terms to the documents they appear in. In this model, commonly referred to as the “bag of words” model, we are unable to determine if terms co-occur within a narrower context such as a paragraph, sentence, or phrase.
This traditional 2D model is illustrated as a database schema in Fig. 1. The term Index table makes up one entity, the Documents table makes up a second entity, and the Postinglist table provides a 2D
Indexing process
The indexing process illustrated in Fig. 5 includes:
- (1)
Lexical partitioning: documents are parsed into sections (title, abstract, body text), and paragraphs. Paragraphs are parsed into sentences.
- (2)
Tokenization: acronyms and their long-forms are identified during indexing using the Schwartz and Hearst algorithm [15]. A long-short form would include “immuno deficiency enzyme (IDE)”, and a short-long form would include “IDE (immuno deficiency enzyme)”. The algorithm works backwards through the long
Query processing
Structured query generation, also shown in Fig. 5, is illustrated with the following query: “Provide information about the role of the gene PRNP (prion protein) in the disease Mad Cow Disease”.
- (1)
Sentences are extracted, and acronyms and their long-forms are identified: PRNP (PRioN protein).
- (2)
Part-of-speed tagging is performed using our second order statistical Hidden Markov Model tagger [18]:… role_NN of_II the_DD gene_NN PRNP_NN (_( prion_NN protein_NN )_) in_II the_DD disease_NN Mad_NN Cow_NN
Methods
We evaluated our system on the TREC (Text REtrieval Conference) 2005 Genomics Track ad-hoc retrieval task which uses a corpus of 4,591,008 MEDLINE citations (∼15 GB) and 49 query topics drawn from the information needs of molecular biology researchers [22]. To the best of our knowledge, the TREC-2005 Genomics collection is the largest collection of MEDLINE citations with relevance judgments. Each MEDLINE citation includes an article title, medical subject headings (MeSH), and typically includes
Baseline
The system delivered baseline results of 0.302 mean average precision (MAP) on the genomics collection for document retrieval using the BM25 retrieval function (k1=1.4, k3=7, b=0.75). This establishes a high-performing baseline for document retrieval which exceeds the top result from the 2005 TREC Genomics Track [22] of 0.288 by 4.9%. Establishing our own baseline is important so that we can independently assess the contributions of contextual and semantic search. We attribute much of the
Discussion and related work
Evaluating our system on the 2005 TREC Genomics track MEDLINE collection allows us to directly compare our methods and results with the top-performing systems on the largest collection of biomedical literature with relevance judgments [22]. Common among the top performing systems from York University [26], IBM [27], and our baseline retrieval model is the use of the probabilistic BM25 document retrieval function and techniques for normalizing biological terms. By exceeding the top result, our
Conclusion
We have presented a novel dimensional information retrieval model for effectively combining concept-based semantics, term statistics, and context for improving search precision of genomics literature. Our experimental results are statistically significant, and exceed the state-of-the-art by 15.28% as assessed by the TREC 2005 Genomics track.
The results demonstrate improved search results in the presence of varying levels of concept-based semantic evidence, and the model still performs at or
Summary
We present a dimensional information retrieval model for combining concept-based semantics and term statistics within multiple levels of document context for accurately identifying concise, variable length passages of text to answer a user query.
The system combines a dimensional data model for indexing scientific literature at multiple levels of document structure and context with a rule-based query processing algorithm. The query processing algorithm extracts and resolves biological concepts
Conflict of interest statement
None declared.
Jay Urbain is an Assistant Professor in the Electrical Engineering and Computer Science Department at the Milwaukee School of Engineering. He recently completed his Ph.D. in Computer Science at Illinois Institute of Technology where he was a member of the Information Retrieval Lab. His research interests include information retrieval, text mining, machine learning, and bioinformatics. He is a member of ACM, ASIS&T, AAAI, and IEEE.
References (37)
Achievable steps toward building a National Health Information infrastructure in the United States
Journal of the American Medical Informatics Association
(2005)- et al.
Challenges in integrating biological data sources
Journal of Computational Biology
(1995) Introduction to Bioinformatics
Journal of the American Society for Information Science and Technology
(2005)- et al.
Information problems in molecular biology and bioinformatics
Journal of the American Society for Information Science & Technology
(2005) - National Institute of Health Roadmap, 〈nihroadmap.nih.gov〉 (accessed...
- M. Hearst, Multi-paragraph segmentation of expository text, in: Proceedings of the 32nd Meeting of the Association for...
- A. Ittycheriah, S. Roukos, IBM's statistical question answering system, in: Proceedings of the 10th Text REtrieval...
- M. Kaszkiel, J. Zobel, Passage retrieval revisited, in: Proceedings of the 20th Annual International ACM-SIGIR...
- et al.
Effective ranking with arbitrary passages
Journal of the American Society of Information Science
(2001) - J. Lin, The role of information retrieval in answering complex questions, in: Proceedings of the 21st International...
Using top-ranking sentences to facilitate effective information access
Journal of the American Society for Information Science and Technology
The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses
Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals
Data Mining and Knowledge Discovery
A simple algorithm for identifying abbreviation definitions in biomedical text
Pacific Symposium on Biocomputing, Kauai
An algorithm for suffix stripping
Program
Cited by (3)
Context-sensitive text mining with fitness leveling Genetic Algorithm
2015, Proceedings - 2015 IEEE 2nd International Conference on Cybernetics, CYBCONF 2015Application of genetic algorithms to context-sensitive text mining
2015, Frontiers in Artificial Intelligence and Applications
Jay Urbain is an Assistant Professor in the Electrical Engineering and Computer Science Department at the Milwaukee School of Engineering. He recently completed his Ph.D. in Computer Science at Illinois Institute of Technology where he was a member of the Information Retrieval Lab. His research interests include information retrieval, text mining, machine learning, and bioinformatics. He is a member of ACM, ASIS&T, AAAI, and IEEE.
Nazli Goharian is Clinical Associate Professor of Computer Science and a member of the Information Retrieval Laboratory at the Illinois Institute of Technology. Her research interests cover search technology from information retrieval to secure information systems. She is a member of ACM & ASIS&T.
Ophir Frieder is the Royden B. Davis Chair in Interdisciplinary Studies at Georgetown University. He is also the IITRI Chair Professor of Computer Science and the Director of the Information Retrieval Laboratory at the Illinois Institute of Technology, from which, he is currently on leave. He frequently consults for industry and government and for key intellectual property litigation. His research interests focus on scalable information retrieval systems spanning search and retrieval and communications issues. His systems are deployed in actual commercial and governmental production environments worldwide. He is a Fellow of the AAAS, ACM, IEEE, and the 2007 ASIST Research Award recipient.