Skip to main content

Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target.

This research work is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the research grant from PA Dept of Health.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berger, A., Lafferty, J.D.: Information Retrieval as Statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)

    Google Scholar 

  2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  3. Hersh, W., et al.: TREC 2004 Genomics Track Overview. In: The thirteenth Text Retrieval Conference (2004)

    Google Scholar 

  4. Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: 2001 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001 (2001)

    Google Scholar 

  5. Lafferty, J., Zhai, C.: Probabilistic relevance models based on document and query generation. In: Language Modeling and Information Retrieval. Kluwer International Series on Information Retrieval, vol. 13 (2003)

    Google Scholar 

  6. Lesk, M.: Automatic Sense Disambiguation: How to Tell a Pine Cone from and Ice Cream Cone. In: Proceedings of the SIGDOC 1986 Conference, ACM (1986)

    Google Scholar 

  7. Mooney, R.J., Bunescu, R.: Mining Knowledge from Text Using Information Extraction. SIGKDD Explorations. Special issue on Text Mining and Natural Language Processing 7(1), 3–10 (2005)

    Google Scholar 

  8. Palakal, M., Stephens, M., Mukhopadhyay, S., Raje, R., Rhodes, S.: A multi-level text mining method to extract biological relationships. In: Proceedings of the IEEE Computer Society Bioinformatics Conference (CBS 2002), August 14-16, pp. 97–108 (2002)

    Google Scholar 

  9. Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval

    Google Scholar 

  10. Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)

    Article  Google Scholar 

  11. Sanderson, M.: Word sense disambiguation and information retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, July 3-6, pp. 142–151 (1994)

    Google Scholar 

  12. Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: "CRYSTAL: Inducing a Conceptual Dictionary". In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319 (1995)

    Google Scholar 

  13. Soderland, S.: Learning Information Extraction rules for Semi-structured and free text. Machine Learning 34, 233–272 (1998)

    Article  MATH  Google Scholar 

  14. Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments, Part I. Information Processing and Management 36, 779–808 (2000)

    Article  Google Scholar 

  15. Stokoe, C., Tait, J.I.: Towards a Sense Based Document Representation for Information Retrieval. In: Proceedings of the Twelfth Text REtrieval Conference (TREC), Gaithersburg M.D (2004)

    Google Scholar 

  16. Subramaniam, L., Mukherjea, S., Kankar, P., Srivastava, B., Batra, V., Kamesam, P., Kothari, R.: Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application. In: The Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, Louisiana (2003)

    Google Scholar 

  17. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2) (April 2004)

    Google Scholar 

  18. Zhou, X., Han, H., Chankai, I., Prestrud, A.,Brooks, A.: Converting Semi-structured Clinical Medical Records into Information and Knowledge. In: Proceeding of The International Workshop on Biomedical Data Engineering (BMDE) in conjunction with the 21stInternational Conference on Data Engineering (ICDE), Tokyo, Japan, April 5-8 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, X., Zhang, X., Hu, X. (2006). Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_39

Download citation

  • DOI: https://doi.org/10.1007/11735106_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33347-0

  • Online ISBN: 978-3-540-33348-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics