Abstract
We explore the use of Optimal Mixture Models to represent topics.We analyze two broad classes of mixture models: set-based and weighted.We provide an original proof that estimation of set-based models is NP-hard, and therefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simple gradient descent technique. We demonstrate that Optimal Mixture Models can be successfully applied to the task of document retrieval. Our experiments show that weighted mixtures outperform a simple language modeling baseline.We also observe that weighted mixtures are more robust than other approaches of estimating topical models.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp194–218, 1998.
D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. In Machine Learning, vol.34, pages 1–34, 1999.
A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: Statistical approaches to answer-finding. In Proceedings of SIGIR, pages 192–199, 2000.
A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 222–229, 1999.
A. Berger and V. Mittal. OCELOT: a system for summarizing web pages. In Proceedings of SIGIR, pages 144–151, 2000.
S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, 1996.
C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel. The TDT-2 text and speech corpus. In Proceedings of the DARPA Broadcast News Workshop, pp 57–60, 1999.
D. Hiemstra. Using language models for information retrieval. In PhD Thesis, University of Twente, 2001.
T. Hoffmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 50–57, 1999.
H. Jin, R. Schwartz, S. Sista, and F. Walls. Topic tracking for radio, TV broadcast and newswire. In Proceedings of DARPA Broadcast News Workshop, pp 199–204, 1999.
J. Lafferty and C. Zhai. Document language models, query models and risk minimization for information retrieval. In Proceedings on the 24th annual international ACM SIGIR conference, pages 111–119, 2001.
J. Lafferty and C. Zhai. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings on the 24th annual international ACM SIGIR conference, pages 111–119, 2001.
V. Lavrenko. Localized smoothing of multinomial language models. In CIIR Technical Report IR-222, 2000.
V. Lavrenko and W.B. Croft. Relevance-based language models. In Proceedings on the 24th annual international ACM SIGIR conference, pages 120–127, 2001.
D. Miller, T. Leek, and R. Schwartz. A hidden markov model information retrieval system. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 214–221, 1999.
P. Ogilvie. Nearest neighbor smoothing of language models in ir. In unpublished, 2000.
J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on Language Modeling and Information Retrieval, pages 37–41, 2001.
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings on the 21st annual international ACM SIGIR conference, pages 275–281, 1998.
S. Robertson and K. S. Jones. Relevance weighting of search terms. In Journal of the American Society for Information Science, vol.27, 1977.
S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference, pages 232–241, 1996.
S. E. Robertson. The Probability Ranking Principle in IR, pages 281–286. Morgan Kaufmann Publishers, Inc., San Francisco, California, 1997.
M. Sipser. Time Complexity: The Cook-Levin Theorem, pages 254–260. PWS Publishing Company, Boston, 1997.
M. Sipser. Time Complexity: The Subset Sum Problem, pages 268–271. PWS Publishing Company, Boston, 1997.
F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings on the 22nd annual international ACM SIGIR conference, pages 279–280, 1999.
H. Turtle and W. B. Croft. Efficient probabilistic inference for text retrieval. In Proceedings of RIAO 3, pages 644–651, 1991.
J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. Topic tracking in a news stream. In Proceedings of DARPA Broadcast News Workshop, pp 133–136, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lavrenko, V. (2002). Optimal Mixture Models in IR. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_14
Download citation
DOI: https://doi.org/10.1007/3-540-45886-7_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43343-9
Online ISBN: 978-3-540-45886-9
eBook Packages: Springer Book Archive