skip to main content
column

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

Published: 02 August 2017 Publication History

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection.

References

[1]
A. Berger and J. Lafferty (1999). "Information retrieval as statistical translation," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222--229.
[2]
S. F. Chen and J. Goodman (1998). "An empirical study of smoothing techniques for language modeling," Tech. Rep. TR-10-98, Harvard University.
[3]
N. Fuhr (1992). "Probabilistic models in information retrieval", The Computer Journal, Vol.35, No.3, pp. 243--255.
[4]
I. J. Good (1953). "The Population Frequencies of Species and the Estimation of Population Parameters," Biometrika, Volume 40, parts 3,4, pp. 237--264.
[5]
D. Hiemstra and W. Kraaij (1998). "Twenty-one at TREC- 7: Ad-hoc and cross-language track," in Proc. of Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD.
[6]
F. Jelinek and R. Mercer (1980). "Interpolated estimation of Markov source parameters from sparse data". In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal (editors), pages 381--402. North Holland, Amsterdam.
[7]
S. M. Katz (1987). "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35, pages 400--401, March 1987.
[8]
R. Kneser and H. Ney (1995). "Improved smoothing for mgram language modeling," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Detroit, MI.
[9]
MacKay, D. and Peto, L. (1995). "A hierarchical Dirichlet language model." Natural Language Engineering, 1(3), pp. 289--307.
[10]
D. H. Miller, T. Leek, and R. Schwartz (1999). "A hidden Markov model information retrieval system," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214--221.
[11]
H. Ney, U. Essen, and R. Kneser (1994). "On structuring probabilistic dependencies in stochastic language modeling," Computer Speech and Language, 8:1--38.
[12]
J. Ponte (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts at Amherst.
[13]
J. Ponte and W. B. Croft (1998). "A language modeling approach to information retrieval," Proceedings of the ACM SIGIR, pp. 275--281.
[14]
C. J. van Rijsbergen (1986). "A Non-classical Logic for Information Retrieval," The Computer Journal, 29(6).
[15]
S. E. Robertson, C. J. van-Rijsbergen, and M. F. Porter (1981). "Probabilistic models of indexing and searching", in Oddy R. N. et al. (Eds.)I nformation Retrieval Research, Butterworths, London, 1981, pp. 35--56.
[16]
S. E. Robertson, S. Walker, S. Jones, M. M. Hancock- Beaulieu, and M. Gatford (1995). "Okapi at TREC-3," The Third Text REtrieval Conference (TREC-3), in D. K. Harman (ed), NIST Special Publication.
[17]
G. Salton and C.Buckley (1988). "Term-weighting approaches in automatic text retrieval," Information Processing and Management, 24, pp. 513--523.
[18]
G. Salton and C. Buckley (1990), "Improving retrieval performance by relevance feedback", Journal of the American Society for Information Science, Vol. 44, No. 4, 288--297.
[19]
A. Singhal, C. Buckley, and M. Mitra (1996). "Pivoted document length normalization," in Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21--29.
[20]
F. Song and B. Croft (1999). "A general language model for information retrieval," in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 279--280.
[21]
K. Sparck Jones (1997). Readings in Information Retrieval, P. Willett, ed., Morgan Kaufmann Publishers.
[22]
S. K. M. Wong and Y. Y. Yao (1995), "On modeling information retrieval with probabilistic inference," ACM Transactions on Information Systems, 13(1), pp. 69--99.

Cited By

View all
  • (2025)A comprehensive overview of topic modeling: Techniques, applications and challengesNeurocomputing10.1016/j.neucom.2025.129638628(129638)Online publication date: May-2025
  • (2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
  • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
  • Show More Cited By

Index Terms

  1. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGIR Forum
        ACM SIGIR Forum  Volume 51, Issue 2
        SIGIR Test-of-Time Awardees 1978-2001
        July 2017
        276 pages
        ISSN:0163-5840
        DOI:10.1145/3130348
        • Editors:
        • Donna Harman,
        • Diane Kelly
        Issue’s Table of Contents
        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 02 August 2017
        Published in SIGIR Volume 51, Issue 2

        Check for updates

        Qualifiers

        • Column

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)104
        • Downloads (Last 6 weeks)18
        Reflects downloads up to 02 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)A comprehensive overview of topic modeling: Techniques, applications and challengesNeurocomputing10.1016/j.neucom.2025.129638628(129638)Online publication date: May-2025
        • (2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
        • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
        • (2024)GaQR: An Efficient Generation-augmented Question RewriterProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679930(4228-4232)Online publication date: 21-Oct-2024
        • (2024)Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIRProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657861(1420-1430)Online publication date: 10-Jul-2024
        • (2024)AdaptiveUKE: Towards adaptive unsupervised keyphrase extraction with gated topic modelingExpert Systems with Applications10.1016/j.eswa.2024.123926250(123926)Online publication date: Sep-2024
        • (2024)Crime risk assessment through Cox and self-exciting spatio-temporal point processesStochastic Environmental Research and Risk Assessment10.1007/s00477-024-02857-239:1(181-203)Online publication date: 13-Nov-2024
        • (2024)Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test CollectionLinking Theory and Practice of Digital Libraries10.1007/978-3-031-72437-4_18(310-325)Online publication date: 24-Sep-2024
        • (2024)Intermediate Hidden Layers for Legal Case Retrieval RepresentationDatabase and Expert Systems Applications10.1007/978-3-031-68312-1_23(306-319)Online publication date: 17-Aug-2024
        • (2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media