skip to main content
10.1145/2911451.2914688acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval

Authors Info & Claims
Published:07 July 2016Publication History

ABSTRACT

Incorporating topic level estimation into language models has been shown to be beneficial for information retrieval (IR) models such as cluster-based retrieval and LDA-based document representation. Neural embedding models, such as paragraph vector (PV) models, on the other hand have shown their effectiveness and efficiency in learning semantic representations of documents and words in multiple Natural Language Processing (NLP) tasks. However, their effectiveness in information retrieval is mostly unknown. In this paper, we study how to effectively use the PV model to improve ad-hoc retrieval. We propose three major improvements over the original PV model to adapt it for the IR scenario: (1) we use a document frequency-based rather than the corpus frequency-based negative sampling strategy so that the importance of frequent words will not be suppressed excessively; (2) we introduce regularization over the document representation to prevent the model overfitting short documents along with the learning iterations; and (3) we employ a joint learning objective which considers both the document-word and word-context associations to produce better word probability estimation. By incorporating this enhanced PV model into the language modeling framework, we show that it can significantly outperform the state-of-the-art topic enhanced language models.

References

  1. A. M. Dai, C. Olah, Q. V. Le, and G. S. Corrado. Document embedding with paragraph vectors. In NIPS Deep Learning Workshop, 2014.Google ScholarGoogle Scholar
  2. D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 795--798. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Huston and W. B. Croft. A comparison of retrieval models using term dependencies. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 111--120. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2177--2185. Curran Associates, Inc., 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186--193. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Mikolov, I. Sutskever, K. Chen, G. S. CJorrado, and M. I. Dean, Jeffdan. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275--281. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of Documentation, 60(5):503--520, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. F. Sun, J. Guo, Y. Lan, J. Xu, and X. Cheng. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 136--145, Beijing, China, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. I. Vulić and M.-F. Moens. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 363--372. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 178--185, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1):1--141, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
        July 2016
        1296 pages
        ISBN:9781450340694
        DOI:10.1145/2911451

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 July 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader