Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR

Zhou, Xiaohua; Zhang, Xiaodan; Hu, Xiaohua

doi:10.1007/11735106_39

Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR

Xiaohua Zhou²²,
Xiaodan Zhang²² &
Xiaohua Hu²²

Conference paper

1569 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target.

This research work is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the research grant from PA Dept of Health.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berger, A., Lafferty, J.D.: Information Retrieval as Statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Article Google Scholar
Hersh, W., et al.: TREC 2004 Genomics Track Overview. In: The thirteenth Text Retrieval Conference (2004)
Google Scholar
Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: 2001 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001 (2001)
Google Scholar
Lafferty, J., Zhai, C.: Probabilistic relevance models based on document and query generation. In: Language Modeling and Information Retrieval. Kluwer International Series on Information Retrieval, vol. 13 (2003)
Google Scholar
Lesk, M.: Automatic Sense Disambiguation: How to Tell a Pine Cone from and Ice Cream Cone. In: Proceedings of the SIGDOC 1986 Conference, ACM (1986)
Google Scholar
Mooney, R.J., Bunescu, R.: Mining Knowledge from Text Using Information Extraction. SIGKDD Explorations. Special issue on Text Mining and Natural Language Processing 7(1), 3–10 (2005)
Google Scholar
Palakal, M., Stephens, M., Mukhopadhyay, S., Raje, R., Rhodes, S.: A multi-level text mining method to extract biological relationships. In: Proceedings of the IEEE Computer Society Bioinformatics Conference (CBS 2002), August 14-16, pp. 97–108 (2002)
Google Scholar
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval
Google Scholar
Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)
Article Google Scholar
Sanderson, M.: Word sense disambiguation and information retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, July 3-6, pp. 142–151 (1994)
Google Scholar
Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: "CRYSTAL: Inducing a Conceptual Dictionary". In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319 (1995)
Google Scholar
Soderland, S.: Learning Information Extraction rules for Semi-structured and free text. Machine Learning 34, 233–272 (1998)
Article MATH Google Scholar
Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments, Part I. Information Processing and Management 36, 779–808 (2000)
Article Google Scholar
Stokoe, C., Tait, J.I.: Towards a Sense Based Document Representation for Information Retrieval. In: Proceedings of the Twelfth Text REtrieval Conference (TREC), Gaithersburg M.D (2004)
Google Scholar
Subramaniam, L., Mukherjea, S., Kankar, P., Srivastava, B., Batra, V., Kamesam, P., Kothari, R.: Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application. In: The Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, Louisiana (2003)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2) (April 2004)
Google Scholar
Zhou, X., Han, H., Chankai, I., Prestrud, A.,Brooks, A.: Converting Semi-structured Clinical Medical Records into Information and Knowledge. In: Proceeding of The International Workshop on Biomedical Data Engineering (BMDE) in conjunction with the 21^stInternational Conference on Data Engineering (ICDE), Tokyo, Japan, April 5-8 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science & Technology, Drexel University, 3141 Chestnut Street, Philadelphia, PA, 19104, USA
Xiaohua Zhou, Xiaodan Zhang & Xiaohua Hu

Authors

Xiaohua Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Queen Mary, University of London, London, UK
Mounia Lalmas
Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK
Andy MacFarlane
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Queen Mary University of London, UK
Anastasios Tombros
CWI, Amsterdam, The Netherlands
Theodora Tsikrika
Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK
Alexei Yavlinsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Zhang, X., Hu, X. (2006). Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_39

Download citation

DOI: https://doi.org/10.1007/11735106_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics