research-article

Positional language models for information retrieval

Authors:

ChengXiang ZhaiAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 299 - 306

https://doi.org/10.1145/1571941.1571994

Published: 19 July 2009 Publication History

Abstract

Although many variants of language models have been proposed for information retrieval, there are two related retrieval heuristics remaining "external" to the language modeling approach: (1) proximity heuristic which rewards a document where the matched query terms occur close to each other; (2) passage retrieval which scores a document mainly based on the best matching passage. Existing studies have only attempted to use a standard language model as a "black box" to implement these heuristics, making it hard to optimize the combination parameters.

In this paper, we propose a novel positional language model (PLM) which implements both heuristics in a unified language model. The key idea is to define a language model for each position of a document, and score a document based on the scores of its PLMs. The PLM is estimated based on propagated counts of words within a document through a proximity-based density function, which both captures proximity heuristics and achieves an effect of "soft" passage retrieval. We propose and study several representative density functions and several different PLM-based document ranking strategies. Experiment results on standard TREC test collections show that the PLM is effective for passage retrieval and performs better than a state-of-the-art proximity-based retrieval model.

References

[1]

]]Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964.

Digital Library

[2]

]]Stefan Buttcher and Charles L. A. Clarke. E±ciency vs. effectiveness in terabyte-scale information retrieval. In Proceedings of TREC '05, 2005.

[3]

]]Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of SIGIR '06, pages 621--622, 2006.

Digital Library

[4]

]]James P. Callan. Passage-level evidence in document retrieval. In Proceedings of SIGIR '94, pages 302--310, 1994.

Digital Library

[5]

]]Charles L. A. Clarke, Gordon V. Cormack, and Forbes J. Burkowski. Shortest substring ranking (multitext experiments for trec-4). In Proceedings of TREC '95, pages 295--304, 1995.

[6]

]]Owen de Kretser and Alistair Moffat. Effective document presentation with a locality-based similarity heuristic. In Proceedings of SIGIR '99, pages 113--120, 1999.

Digital Library

[7]

]]David Hawking and Paul B. Thistlewaite. Proximity operators -- so near and yet so far. In Proceedings of TREC '95, pages 500--236, 1995.

[8]

]]Marcin Kaszkiel and Justin Zobel. Passage retrieval revisited. In Proceedings of SIGIR '97, pages 178--185, 1997.

Digital Library

[9]

]]Marcin Kaszkiel and Justin Zobel. Effective ranking with arbitrary passages. Journal of the American Society for Information Science and Technology, 52(4):344--364, 2001.

[10]

]]Marcin Kaszkiel, Justin Zobel, and Ron Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems, 17(4):406--439, 1999.

Digital Library

[11]

]]E. Michael Keen. The use of term position devices in ranked output experiments. The Journal of Documentation, 47(1):1--22, 1991.

Digital Library

[12]

]]E. Michael Keen. Some aspects of proximity searching in text retrieval systems. Journal of Information Science, 18(2):89--98, 1992.

Digital Library

[13]

]]Koichi Kise, Markus Junker, Andreas Dengel, and Keinosuke Matsumoto. Passage Retrieval Based on Density Distributions of Terms and Its Applications to Document Retrieval and Question Answering, volume 2956 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2004.

[14]

]]John D. Lafferty and Chengxiang Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR '01, pages 111--119, 2001.

Digital Library

[15]

]]Victor Lavrenko and W. Bruce Croft. Relevance-based language models. In Proceedings of SIGIR '01, pages 120--127, 2001.

Digital Library

[16]

]]Xiaoyong Liu and W. Bruce Croft. Passage retrieval based on language models. In Proceedings of CIKM '02, pages 375--382, 2002.

Digital Library

[17]

]]Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using language models. In Proceedings of SIGIR '04, pages 186--193, 2004.

Digital Library

[18]

]]David E. Losada and Leif Azzopardi. An analysis on document length retrieval trends in language modeling smoothing. Information Retrieval, 11(2):109--138, 2008.

Digital Library

[19]

]]Annabelle Mercier and Michel Beigbeder. Fuzzy proximity ranking with boolean queries. In Proceedings of TREC '05, 2005.

[20]

]]Donald Metzler and W. Bruce Croft. A markov random field model for term dependencies. In Proceedings of SIGIR '05, pages 472--479, 2005.

Digital Library

[21]

]]Christof Monz. Minimal span weighting retrieval for question answering. In Rob Gaizauskas, Mark Greenwood, and Mark Hepple, editors, SIGIR Workshop on Information Retrieval for Question Answering, pages 23--30, 2004.

[22]

]]Desislava Petkova and W. Bruce Croft. Proximity-based document representation for named entity retrieval. In Proceedings of CIKM '07, pages 731--740, 2007.

Digital Library

[23]

]]Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR '98, pages 275--281, 1998.

Digital Library

[24]

]]Yves Rasolofo and Jacques Savoy. Term proximity scoring for keyword--based retrieval systems. In Proceedings of ECIR '03, pages 207--218, 2003.

Digital Library

[25]

]]Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR '93, pages 49--58, 1993.

Digital Library

[26]

]]Fei Song and W. Bruce Croft. A general language model for information retrieval. In Proceedings of CIKM '99, pages 316--321, 1999.

Digital Library

[27]

]]Ruihua Song, Ji-Rong Wen, and Wei-Ying Ma. Viewing term proximity from a different perspective. In Proceedings of ECIR'08, 2008.

Digital Library

[28]

]]Tao Tao and ChengXiang Zhai. An exploration of proximity measures in information retrieval. In Proceedings of SIGIR '07, pages 295--302, 2007.

Digital Library

[29]

]]Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of SIGIR'03, pages 41--47, 2003.

Digital Library

[30]

]]ChengXiang Zhai. Statistical language models for information retrieval a critical review. Found. Trends Inf. Retr., 2(3):137--213, 2008.

Digital Library

[31]

]]ChengXiang Zhai and John D. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM '01, pages 403--410, 2001.

Digital Library

[32]

]]ChengXiang Zhai and John D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR '01, pages 334--342, 2001.

Digital Library

Cited By

Ali ATufail ADe Silva LAbas P(2024)Innovating Patent Retrieval: A Comprehensive Review of Techniques, Trends, and Challenges in Prior Art SearchesApplied System Innovation10.3390/asi70500917:5(91)Online publication date: 26-Sep-2024
https://doi.org/10.3390/asi7050091
Repp SHaffner EGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Dynamic Segmentation for Efficient Retrieval of Podcasts: The Repping AlgorithmProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658047(29-36)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658047
Pan MZhou SLi TLiu YPei QHuang AHuang J(2024)Utilizing passage‐level relevance and kernel pooling for enhancing BERT‐based document rerankingComputational Intelligence10.1111/coin.1265640:3Online publication date: 7-Jun-2024
https://doi.org/10.1111/coin.12656
Show More Cited By

Index Terms

Positional language models for information retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...
Passage retrieval based on language models
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative ...
Discriminative probabilistic models for passage based retrieval
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

The approach of using passage-level evidence for document retrieval has shown mixed results when it is applied to a variety of test beds with different characteristics. One main reason of the inconsistent performance is that there exists no unified ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

135
Total Citations
View Citations
1,689
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)4

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ali ATufail ADe Silva LAbas P(2024)Innovating Patent Retrieval: A Comprehensive Review of Techniques, Trends, and Challenges in Prior Art SearchesApplied System Innovation10.3390/asi70500917:5(91)Online publication date: 26-Sep-2024
https://doi.org/10.3390/asi7050091
Repp SHaffner EGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Dynamic Segmentation for Efficient Retrieval of Podcasts: The Repping AlgorithmProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658047(29-36)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658047
Pan MZhou SLi TLiu YPei QHuang AHuang J(2024)Utilizing passage‐level relevance and kernel pooling for enhancing BERT‐based document rerankingComputational Intelligence10.1111/coin.1265640:3Online publication date: 7-Jun-2024
https://doi.org/10.1111/coin.12656
Pimpalkhute VHeyer JYin XGupta S(2024)SoftQE: Learned Representations of Queries Expanded by LLMsAdvances in Information Retrieval10.1007/978-3-031-56066-8_8(68-77)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56066-8_8
Vanetik NPodkaminer ELitvak M(2023)Summarizing Financial Reports with Positional Language Model2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386704(2877-2883)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386704
Pal DGanguly DHasibi FFang YAizawa A(2021)Effective Query Formulation in Conversation Contextualization: A Query Specificity-based ApproachProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472237(177-183)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472237
Jain AKothyari MKumar VJyothi PRamakrishnan GChakrabarti SDiaz FShah CSuel TCastells PJones RSakai T(2021)Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question AnsweringProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463259(2491-2498)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463259
Dang ELuk RAllan J(2021)A Principled Approach Using Fuzzy Set Theory for Passage-Based Document RetrievalIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2020.299011029:7(1967-1977)Online publication date: Jul-2021
https://doi.org/10.1109/TFUZZ.2020.2990110
Nga CLi CLi YWang J(2021)A Survey of Vietnamese Automatic Speech Recognition2021 9th International Conference on Orange Technology (ICOT)10.1109/ICOT54518.2021.9680652(1-4)Online publication date: 16-Dec-2021
https://doi.org/10.1109/ICOT54518.2021.9680652
Sepúlveda-Torres RVicente MSaquete ELloret EPalomar M(2021)HeadlineStanceChecker: Exploiting summarization to detect headline disinformationJournal of Web Semantics10.1016/j.websem.2021.100660(100660)Online publication date: Sep-2021
https://doi.org/10.1016/j.websem.2021.100660
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten