skip to main content
column

Information Retrieval as Statistical Translation

Published: 02 August 2017 Publication History

Abstract

We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might distill or "translate" a given document into a query. To assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a prior distribution over documents. We propose a simple, well motivated model of the document-to-query translation process, and describe an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents. As we show, one can view this approach as a generalization and justification of the "language modeling" strategy recently proposed by Ponte and Croft. In a series of experiments on TREC data, a simple translation-based retrieval system performs well in comparison to conventional retrieval techniques. This prototype system only begins to tap the full potential of translation-based retrieval.

References

[1]
A. Bookstein and D. Swanson (1974). "Probabilis- tic models for automatic indexing," Journal of the American Society for Information Science, 25, pp. 312--318.
[2]
A. Broder and M. Henzinger (1998). "Information retrieval on the web: Tools and algorithmic issues," Invited tutorial at Foundations of Computer Sci- ence (FOCS).
[3]
P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin (1990). "A statistical approach to machine translation," Computational Linguistics, 16(2), pp. 79--85.
[4]
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer (1993). "The mathematics of statistical machine translation: Parameter estimation," Computational Linguistics, 19(2), pp. 263--311.
[5]
P. Brown, S. Della Pietra, V. Della Pietra, M. Goldsmith, J. Hajic, R. Mercer, and S. Mohanty (1993). "But dictionaries are data too," In Proceedings of the ARPA Human Language Technology Workshop, Plainsborough, New Jersey.
[6]
W. B. Croft and D.J. Harper (1979). "Using probabilistic models of document retrieval without relevance information," Journal of Documentation, 35, pp. 285--295.
[7]
A. Dempster, N. Laird, and D. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society, 39(B), pp. 1--38.
[8]
W. Gale and K. Church (1991). "Identifying word correspondences in parallel texts," in Fourth DARPA Workshop on Speech and Natural Language, Morgan Kaufmann Publishers, pp. 152--157.
[9]
J.Ponte (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts at Amherst.
[10]
J. Ponte and W. B. Croft (1998). "A language modeling approach to information retrieval," Proceedings of the ACM SIGIR, pp. 275--281.
[11]
S.E. Robertson and K. Sparck Jones (1976). "Relevance weighting of search terms," Journal of the American Society for Information Science, 27, pp. 129--146.
[12]
S. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau (1992). "Okapi at TREC," In Proceedings of the first Text REtrieval Conference (TREC-1), Gaithersburg, Maryland.
[13]
G. Salton and C. Buckley (1988). "Term-weighting approaches in automatic text retrieval," Information Processing and Management, 24, pp. 513--523.
[14]
H.Turtle and W. B. Croft (1991). "Efficient probabilistic inference for text retrieval," Proceedings of RIAO 3
[15]
W. Weaver (1955). "Translation (1949)," In Machine Translation of Languages, MIT Press.

Cited By

View all
  • (2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
  • (2024)Vocabulary-Enhanced Named Entity Recognition and Its Application on Distribution Network MaintenanceJournal of Circuits, Systems and Computers10.1142/S0218126625501002Online publication date: 25-Sep-2024
  • (2024)Automated Commit Message Generation with Large Language Models: An Empirical Study and BeyondIEEE Transactions on Software Engineering10.1109/TSE.2024.3478317(1-16)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Information Retrieval as Statistical Translation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM SIGIR Forum
          ACM SIGIR Forum  Volume 51, Issue 2
          SIGIR Test-of-Time Awardees 1978-2001
          July 2017
          276 pages
          ISSN:0163-5840
          DOI:10.1145/3130348
          • Editors:
          • Donna Harman,
          • Diane Kelly
          Issue’s Table of Contents
          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 02 August 2017
          Published in SIGIR Volume 51, Issue 2

          Check for updates

          Qualifiers

          • Column

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)67
          • Downloads (Last 6 weeks)9
          Reflects downloads up to 16 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Informatization of the Traditional Literature of Donglu Drums and Its Supporting Role for InheritanceApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-33849:1Online publication date: 18-Nov-2024
          • (2024)Vocabulary-Enhanced Named Entity Recognition and Its Application on Distribution Network MaintenanceJournal of Circuits, Systems and Computers10.1142/S0218126625501002Online publication date: 25-Sep-2024
          • (2024)Automated Commit Message Generation with Large Language Models: An Empirical Study and BeyondIEEE Transactions on Software Engineering10.1109/TSE.2024.3478317(1-16)Online publication date: 2024
          • (2024)PIM-ST: a New Paraphrase Identification Model Incorporating Sequence and Topic Information2024 4th International Symposium on Computer Technology and Information Science (ISCTIS)10.1109/ISCTIS63324.2024.10699008(894-898)Online publication date: 12-Jul-2024
          • (2023)An Improved Sentence Embeddings based Information Retrieval Technique using Query Reformulation2023 International Conference on Advancement in Computation & Computer Technologies (InCACCT)10.1109/InCACCT57535.2023.10141788(299-304)Online publication date: 5-May-2023
          • (2023)A Chinese NER Method Based on Chinese Characters' Multiple Information2023 IEEE 2nd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA)10.1109/EEBDA56825.2023.10090838(1048-1052)Online publication date: 24-Feb-2023
          • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
          • (2023)Shop by image: characterizing visual search in e-commerceInformation Retrieval10.1007/s10791-023-09418-126:1-2Online publication date: 3-Mar-2023
          • (2023)Foundation Models for Text GenerationFoundation Models for Natural Language Processing10.1007/978-3-031-23190-2_6(227-311)Online publication date: 27-Feb-2023
          • (2022)On the Interpolation of Contextualized Term-based Ranking with BM25 for Query-by-Example RetrievalProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545133(161-170)Online publication date: 23-Aug-2022
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media