Skip to main content

Optimal Mixture Models in IR

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2291))

Abstract

We explore the use of Optimal Mixture Models to represent topics.We analyze two broad classes of mixture models: set-based and weighted.We provide an original proof that estimation of set-based models is NP-hard, and therefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simple gradient descent technique. We demonstrate that Optimal Mixture Models can be successfully applied to the task of document retrieval. Our experiments show that weighted mixtures outperform a simple language modeling baseline.We also observe that weighted mixtures are more robust than other approaches of estimating topical models.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp194–218, 1998.

    Google Scholar 

  2. D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. In Machine Learning, vol.34, pages 1–34, 1999.

    Article  Google Scholar 

  3. A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: Statistical approaches to answer-finding. In Proceedings of SIGIR, pages 192–199, 2000.

    Google Scholar 

  4. A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 222–229, 1999.

    Google Scholar 

  5. A. Berger and V. Mittal. OCELOT: a system for summarizing web pages. In Proceedings of SIGIR, pages 144–151, 2000.

    Google Scholar 

  6. S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, 1996.

    Google Scholar 

  7. C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel. The TDT-2 text and speech corpus. In Proceedings of the DARPA Broadcast News Workshop, pp 57–60, 1999.

    Google Scholar 

  8. D. Hiemstra. Using language models for information retrieval. In PhD Thesis, University of Twente, 2001.

    Google Scholar 

  9. T. Hoffmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 50–57, 1999.

    Google Scholar 

  10. H. Jin, R. Schwartz, S. Sista, and F. Walls. Topic tracking for radio, TV broadcast and newswire. In Proceedings of DARPA Broadcast News Workshop, pp 199–204, 1999.

    Google Scholar 

  11. J. Lafferty and C. Zhai. Document language models, query models and risk minimization for information retrieval. In Proceedings on the 24th annual international ACM SIGIR conference, pages 111–119, 2001.

    Google Scholar 

  12. J. Lafferty and C. Zhai. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings on the 24th annual international ACM SIGIR conference, pages 111–119, 2001.

    Google Scholar 

  13. V. Lavrenko. Localized smoothing of multinomial language models. In CIIR Technical Report IR-222, 2000.

    Google Scholar 

  14. V. Lavrenko and W.B. Croft. Relevance-based language models. In Proceedings on the 24th annual international ACM SIGIR conference, pages 120–127, 2001.

    Google Scholar 

  15. D. Miller, T. Leek, and R. Schwartz. A hidden markov model information retrieval system. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 214–221, 1999.

    Google Scholar 

  16. P. Ogilvie. Nearest neighbor smoothing of language models in ir. In unpublished, 2000.

    Google Scholar 

  17. J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on Language Modeling and Information Retrieval, pages 37–41, 2001.

    Google Scholar 

  18. J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings on the 21st annual international ACM SIGIR conference, pages 275–281, 1998.

    Google Scholar 

  19. S. Robertson and K. S. Jones. Relevance weighting of search terms. In Journal of the American Society for Information Science, vol.27, 1977.

    Google Scholar 

  20. S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference, pages 232–241, 1996.

    Google Scholar 

  21. S. E. Robertson. The Probability Ranking Principle in IR, pages 281–286. Morgan Kaufmann Publishers, Inc., San Francisco, California, 1997.

    Google Scholar 

  22. M. Sipser. Time Complexity: The Cook-Levin Theorem, pages 254–260. PWS Publishing Company, Boston, 1997.

    Google Scholar 

  23. M. Sipser. Time Complexity: The Subset Sum Problem, pages 268–271. PWS Publishing Company, Boston, 1997.

    Google Scholar 

  24. F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 279–280, 1999.

    Google Scholar 

  25. H. Turtle and W. B. Croft. Efficient probabilistic inference for text retrieval. In Proceedings of RIAO 3, pages 644–651, 1991.

    Google Scholar 

  26. J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. Topic tracking in a news stream. In Proceedings of DARPA Broadcast News Workshop, pp 133–136, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lavrenko, V. (2002). Optimal Mixture Models in IR. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-45886-7_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43343-9

  • Online ISBN: 978-3-540-45886-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics