column

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

Authors:

Chengxiang Zhai,

John LaffertyAuthors Info & Claims

ACM SIGIR Forum, Volume 51, Issue 2

Pages 268 - 276

https://doi.org/10.1145/3130348.3130377

Published: 02 August 2017 Publication History

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection.

References

[1]

A. Berger and J. Lafferty (1999). "Information retrieval as statistical translation," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222--229.

Digital Library

[2]

S. F. Chen and J. Goodman (1998). "An empirical study of smoothing techniques for language modeling," Tech. Rep. TR-10-98, Harvard University.

[3]

N. Fuhr (1992). "Probabilistic models in information retrieval", The Computer Journal, Vol.35, No.3, pp. 243--255.

Digital Library

[4]

I. J. Good (1953). "The Population Frequencies of Species and the Estimation of Population Parameters," Biometrika, Volume 40, parts 3,4, pp. 237--264.

[5]

D. Hiemstra and W. Kraaij (1998). "Twenty-one at TREC- 7: Ad-hoc and cross-language track," in Proc. of Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD.

[6]

F. Jelinek and R. Mercer (1980). "Interpolated estimation of Markov source parameters from sparse data". In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal (editors), pages 381--402. North Holland, Amsterdam.

[7]

S. M. Katz (1987). "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35, pages 400--401, March 1987.

[8]

R. Kneser and H. Ney (1995). "Improved smoothing for mgram language modeling," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Detroit, MI.

[9]

MacKay, D. and Peto, L. (1995). "A hierarchical Dirichlet language model." Natural Language Engineering, 1(3), pp. 289--307.

[10]

D. H. Miller, T. Leek, and R. Schwartz (1999). "A hidden Markov model information retrieval system," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214--221.

Digital Library

[11]

H. Ney, U. Essen, and R. Kneser (1994). "On structuring probabilistic dependencies in stochastic language modeling," Computer Speech and Language, 8:1--38.

[12]

J. Ponte (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts at Amherst.

Digital Library

[13]

J. Ponte and W. B. Croft (1998). "A language modeling approach to information retrieval," Proceedings of the ACM SIGIR, pp. 275--281.

Digital Library

[14]

C. J. van Rijsbergen (1986). "A Non-classical Logic for Information Retrieval," The Computer Journal, 29(6).

[15]

S. E. Robertson, C. J. van-Rijsbergen, and M. F. Porter (1981). "Probabilistic models of indexing and searching", in Oddy R. N. et al. (Eds.)I nformation Retrieval Research, Butterworths, London, 1981, pp. 35--56.

[16]

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock- Beaulieu, and M. Gatford (1995). "Okapi at TREC-3," The Third Text REtrieval Conference (TREC-3), in D. K. Harman (ed), NIST Special Publication.

[17]

G. Salton and C.Buckley (1988). "Term-weighting approaches in automatic text retrieval," Information Processing and Management, 24, pp. 513--523.

Digital Library

[18]

G. Salton and C. Buckley (1990), "Improving retrieval performance by relevance feedback", Journal of the American Society for Information Science, Vol. 44, No. 4, 288--297.

[19]

A. Singhal, C. Buckley, and M. Mitra (1996). "Pivoted document length normalization," in Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21--29.

Digital Library

[20]

F. Song and B. Croft (1999). "A general language model for information retrieval," in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 279--280.

Digital Library

[21]

K. Sparck Jones (1997). Readings in Information Retrieval, P. Willett, ed., Morgan Kaufmann Publishers.

Digital Library

[22]

S. K. M. Wong and Y. Y. Yao (1995), "On modeling information retrieval with probabilistic inference," ACM Transactions on Information Systems, 13(1), pp. 69--99.

Digital Library

Cited By

Hankar MKasri MBeni-Hssane A(2025)A comprehensive overview of topic modeling: Techniques, applications and challengesNeurocomputing10.1016/j.neucom.2025.129638628(129638)Online publication date: May-2025
https://doi.org/10.1016/j.neucom.2025.129638
Ma XLiu N(2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
https://doi.org/10.2478/amns-2024-3384
Rashidi LZobel JMoffat AOosterhuis HBast HXiong C(2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672519
Show More Cited By

Index Terms

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

A study of smoothing methods for language models applied to information retrieval

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
A study of smoothing methods for language models applied to Ad Hoc information retrieval
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Log-Bilinear Document Language Model for Ad-hoc Information Retrieval
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Incorporating semantic information into document representation is effective and potentially significant to improve retrieval performance. Recently, log-bilinear language model (LBL), as a form of neural language model, has been proved to be an effective ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGIR Forum

ACM SIGIR Forum Volume 51, Issue 2

SIGIR Test-of-Time Awardees 1978-2001

July 2017

276 pages

ISSN:0163-5840

DOI:10.1145/3130348

Editors:
Donna Harman
National Institutes of Science & Technology, Gaithersburg MD, USA
,
Diane Kelly
University of Tennessee, Knoxville TN, USA

Issue’s Table of Contents

Copyright © 2017 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2017

Published in SIGIR Volume 51, Issue 2

Check for updates

Qualifiers

Column

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

97
Total Citations
View Citations
1,851
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)18

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hankar MKasri MBeni-Hssane A(2025)A comprehensive overview of topic modeling: Techniques, applications and challengesNeurocomputing10.1016/j.neucom.2025.129638628(129638)Online publication date: May-2025
https://doi.org/10.1016/j.neucom.2025.129638
Ma XLiu N(2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
https://doi.org/10.2478/amns-2024-3384
Rashidi LZobel JMoffat AOosterhuis HBast HXiong C(2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672519
Young OFan YZhang RGuo Jde Rijke MCheng XSerra ESpezzano F(2024)GaQR: An Efficient Generation-augmented Question RewriterProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679930(4228-4232)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679930
Thakur NBonifacio LFröbe MBondarenko AKamalloo EPotthast MHagen MLin JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIRProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657861(1420-1430)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657861
Liu QKe WYuan XYang YZhao HWang P(2024)AdaptiveUKE: Towards adaptive unsupervised keyphrase extraction with gated topic modelingExpert Systems with Applications10.1016/j.eswa.2024.123926250(123926)Online publication date: Sep-2024
https://doi.org/10.1016/j.eswa.2024.123926
Escudero IAngulo JMateu JChoiruddin A(2024)Crime risk assessment through Cox and self-exciting spatio-temporal point processesStochastic Environmental Research and Risk Assessment10.1007/s00477-024-02857-239:1(181-203)Online publication date: 13-Nov-2024
https://doi.org/10.1007/s00477-024-02857-2
Menotti LBarusco MForzan RSilvello G(2024)Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test CollectionLinking Theory and Practice of Digital Libraries10.1007/978-3-031-72437-4_18(310-325)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72437-4_18
Hammami EBoughanem MFaiz RDkaki T(2024)Intermediate Hidden Layers for Legal Case Retrieval RepresentationDatabase and Expert Systems Applications10.1007/978-3-031-68312-1_23(306-319)Online publication date: 17-Aug-2024
https://doi.org/10.1007/978-3-031-68312-1_23
Singh V(2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
https://dl.acm.org/doi/10.4018/IJIRR.325072
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents