skip to main content
research-article

TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback

Published: 29 August 2016 Publication History

Abstract

Traditional pseudo relevance feedback (PRF) models choose top k feedback documents for query expansion and treat those documents equally. When k is determined, feedback terms are selected without considering the reliability of these documents for relevance. Because the performance of PRF is sensitive to the selection of feedback terms, noisy terms imported from these irrelevant documents or partially relevant documents will harm the final results extensively. Intuitively, terms in these documents should be considered less important for feedback term selection. Nonetheless, how to measure the reliability of feedback documents is a difficult problem.
Recently, topic modeling has become more and more popular in the information retrieval (IR) area. In order to identify how reliable a feedback document is to be relevant, we attempt to adapt the topical information into PRF. However, topics are hard to be quantified and therefore the identification of topic is usually fuzzy. It is very challenging for integrating the obtained topical information effectively into IR and other text-processing-related areas. Current research work mainly focuses on mining relevant information from particular topics. This is extremely difficult when the boundaries of different topics are hard to define. In this article, we investigate a key factor of this problem, the topic number for topic modeling and how it makes topics “fuzzy.” To effectively and efficiently apply topical information, we propose a new probabilistic framework, “TopPRF,” and three models, TS-COS, TS-EU, and TS-Entropy, via integrating “Topic Space” (TS) information into pseudo relevance feedback. These methods discover how reliable a document is to be relevant through both term and topical information. When selecting feedback terms, candidate terms in more reliable feedback documents should obtain extra weights. Experimental results on various public collections justify that our proposed methods can significantly reduce the influence of “fuzzy topics” and obtain stable, good results over the strong baseline models. Our proposed probabilistic framework, TopPRF, and three topic-space-based models are capable of searching documents beyond traditional term matching only and provide a promising avenue for constructing better topic-space-based IR systems. Moreover, in-depth discussions and conclusions are made to help other researchers apply topical information effectively.

Supplementary Material

a22-miao-apndx.pdf (miao.zip)
Supplemental movie, appendix, image and software files for, TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback

References

[1]
J. Allan, M. E. Connell, W. B. Croft, F. Feng, D. Fisher, and X. Li. 2000. INQUERY and TREC-9. In Proceedings of the 9th Text REtrieval Conference, 13.
[2]
D. Andrzejewski and D. Buttler. 2011. Latent topic feedback for information retrieval. In Proceedings of the 17th ACM Conference on Knowledge Discovery and Data Mining, 600--608. ACM, New York, NY.
[3]
M. Beaulieu, M. Gatford, X. Huang, S. Robertson, S. Walker, and P. Williams. 1997. Okapi at TREC-5. In Proceedings of the 5th Text REtrieval Conference. NIST Special Publication SP, 143166.
[4]
J. Bian, Y. Yang, and T. Chua. 2013. Multimedia summarization for trending topics in microblogs. In 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 1807--1812.
[5]
D. M. Blei, A. Y. Ng, and Jordan, M. I. 2003a. Latent Dirichlet allocation. Journal of Machine Learning Research 3:993--1022.
[6]
G. Blei and J. Tenenbaum. 2004. Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems 16:17--25.
[7]
K. L. Caballero and R. Akella. 2012. Incorporating statistical topic information in relevance feedback. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1093--1094. ACM, New York, NY.
[8]
G. Cao, J.-Y. Nie, J. Gao, and S. Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 243--250.
[9]
C. Carpineto, R. de Mori, G. Romano, and B. Bigi. 2001. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS) 19(1):1--27.
[10]
Y. Chen, H. Amiri, Z. Li, and T. Chua. 2013. Emerging topic detection for organizations from microblogs. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), 43--52.
[11]
C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval, 659--666. ACM.
[12]
K. Collins-Thompson. 2009. Reducing the risk of query expansion via robust constrained optimization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 837--846. ACM, New York, NY.
[13]
J. S. Culpepper, S. Mizzaro, M. Sanderson, and F. Scholer. 2014. Trec: Topic engineering exercise. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1147--1150. ACM, New York, NY.
[14]
R. Cummins, J. H. Paik, and Y. Lv. 2015. A pólya urn document language model for improved information retrieval. ACM Transactions on Information Systems (TOIS) 33(4):21:1--21:34.
[15]
S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 6(6):721--741.
[16]
T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl 1):5228--5235.
[17]
J. He 2011. Exploring Topic Structure: Coherence, Diversity and Relatedness. ISBN 9789490371814.
[18]
T. Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 50--57. ACM, New York, NY.
[19]
J. X. Huang, J. Miao, and B. He. 2013. High performance query expansion using adaptive co-training. Information Processing & Management 49(2):441--453.
[20]
X. Huang and Q. Hu. 2009. A Bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY.
[21]
X. Huang, Y. R. Huang, M. Wen, A. An, Y. Liu, and J. Poon. 2006. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In Proceedings of the 6th IEEE International Conference on Data Mining. 295--306. IEEE.
[22]
X. Huang, M. Zhong, and L. Si. 2005. York University at TREC 2005: Genomics track. In Proceedings of the 14th Text REtrieval Conference.
[23]
F. Jian, J. X. Huang, J. Zhao. 2016. A simple enhancement for ad-hoc information retrieval via topic modelling. In Proceedings of the 39nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
[24]
A. Kotov, Y. Wang, and E. Agichtein. 2013. Leveraging geographical metadata to improve search over social media. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13 Companion), 151--152.
[25]
V. Lavrenko and W. B. Croft. 2001. Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 120--127.
[26]
W. Li and A. McCallum. 2006. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. 577--584, New York, NY, USA. ACM.
[27]
Y. Liu, Z. Liu, T. Chua, and M. Sun. 2015. Topical word embeddings. In Proceedings of the 29th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas,USA. 2418--2424.
[28]
Y. Lv and C. Zhai. 2009. A comparative study of methods for estimating query language models with pseudo feedback. In Proceedings of the International Conference on Information and Knowledge Management. 1895--1898. ACM.
[29]
Y. Lv and C. Zhai. 2010. Positional relevance model for pseudo-relevance feedback. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 579--586. ACM.
[30]
Y. Lv, C. Zhai, and W. Chen. 2011. A boosting approach to improving pseudo-relevance feedback. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165--174. ACM.
[31]
Q. Mei, X. Shen, and C. Zhai. 2007. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD Conferences on Knowledge Discovery and Data Mining, KDD’07, 490--499, New York, NY, USA. ACM.
[32]
J. Miao, J. X. Huang, and Z. Ye. 2012. Proximity-based Rocchio’s model for pseudo relevance. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 535--544, New York, NY, USA. ACM.
[33]
I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma 2006. Terrier: A high performance and scalable information retrieval platform. In Proceedings of the OSIR Workshop. 18--25.
[34]
I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of the 14th ACM SIGKDD Conferences on Knowledge Discovery and Data Mining. 569--577. ACM.
[35]
M. Porter. 1980. An algorithm for suffix stripping. Program, 14:130--137.
[36]
K. Raman, R. Udupa, P. Bhattacharyya, and A. Bhole. 2010. On improving pseudo-relevance feedback using pseudo-irrelevant documents. In Proceedings of 32nd European Conference on Information Retrieval. 573--576, 2010.
[37]
S. Robertson and H. Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. 3(4): 333-389. Now Publishers Inc. Hanover, MA, USA.
[38]
S. E. Robertson, S. Walker, S. Jones, Hancock-M. Beaulieu, and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference.
[39]
S. E. Robertson and S. Walker 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, 232--241.
[40]
J. Rocchio. 1971. Relevance feedback in information retrieval, 313--323. Prentice-Hall Englewood Cliffs.
[41]
G. Salton, A. Wong, and C. Yang. 1975a. A vector space model for information retrieval. Journal of American Society for Information Retrieval, 18(11):613--620.
[42]
G. Salton, A. Wong, and C. S. Yang. 1975b. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620.
[43]
M. Serizawa and I. Kobayashi. 2013. A study on query expansion based on topic distributions of retrieved documents. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 7817 of Lecture Notes in Computer Science, 369--379. Springer Berlin Heidelberg.
[44]
H. Stark, Y. Yang, and Y. Yang. 1998. Vector space projections: A numerical approach to signal and image processing, neural nets, and optics. John Wiley & Sons, Inc. ISBN:0471241407.
[45]
T. Strohman, D. Metzler, H. Turtle, and W. B. Croft 2005. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis. Vol. 2. 2--6.
[46]
J. Tang, R. Jin, and J. Zhang. 2008. A topic modeling approach and its integration into the random walk framework for academic search. In Proceedings of the 8th IEEE International Conference on Data Mining, 1055--1060. IEEE.
[47]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2012. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006. 101{476}:1566--1581.
[48]
E. M. Voorhees and D. Harman. 2000. Overview of the sixth text retrieval conference. Information Processing and Management: an International Journal, 36:3--35.
[49]
B. Walsh. 2004. Markov chain Monte Carlo and Gibbs sampling. Lecture Notes for EEB 581. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.4064.
[50]
C. Wang and D. M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD Conferences on Knowledge Discovery and Data Mining, 448--456, New York, NY, USA. ACM.
[51]
X. Wang, Q. Zhang, X. Wang, and Y. Sun. 2012. LDA based pseudo relevance feedback for cross language information retrieval. In Cloud Computing and Intelligent Systems (CCIS), volume 03, 1511--1516.
[52]
X. Wei and W. B. Croft. 2006. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 178--185. ACM.
[53]
R. W. White and G. Marchionini. 2007. Examining the effectiveness of real-time query expansion. Information Processing and Management, 43(3):685--704, 2007.
[54]
J. Xu and W. B. Croft. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems, 18(1):79--112, 2000.
[55]
Z. Ye, B. He, X. Huang, and H. Lin. 2010. Revisiting Rocchio’s relevance feedback algorithm for probabilistic models. In Information Retrieval Technology, volume 6458, 151--161. Springer Berlin Heidelberg.
[56]
Z. Ye and J. X. Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 323--332.
[57]
Z. Ye and J. X. Huang. 2016. A learning to rank approach for quality-aware pseudo-relevance feedback. Journal of the Association for Information Science and Technology 67(4): 942--959.
[58]
Z. Ye, J. X. Huang, and H. Lin. 2011. Finding a good query-related topic for boosting pseudo-relevance feedback. Journal of the American Society for Information Science and Technology 62(4):748--760.
[59]
Z. Ye, J. X. Huang, and J. Miao. 2012. A hybrid model for ad-hoc information retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1025--1026, New York, NY, USA. ACM.
[60]
X. Yi and J. Allan. 2008. Evaluating topic models for information retrieval. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management. 1431--1432. ACM.
[61]
X. Yi and J. Allan. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on Information Retrieval. 29--41, Berlin, Heidelberg. Springer-Verlag.
[62]
X. Yin, J. Huang, Z. Li, and X. Zhou. 2013. A survival modeling approach to biomedical search result diversification using wikipedia. IEEE Trans. Knowl. Data Eng. 25, 6, 12011212.
[63]
C. Zhai. 2008. Statistical language models for information retrieval a critical review. Found. Trends Inf. Retr., 2:137--213.
[64]
C. Zhai and J. Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 334--342, New Orleans, LA.
[65]
C. Zhai and J. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. Foundations and Trends in Information Retrieval 22(2):179--214.
[66]
J. Zhao, J. X. Huang, and B. He. 2011. CRTER: Using cross terms to enhance probabilistic information retrieval. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 155--164, New York, USA. ACM.
[67]
J. Zhao, J. X. Huang, and Z. Ye. 2014. Modeling term associations for probabilistic information retrieval. ACM Transactions on Information Systems (TOIS), 32(2), 7, 47.
[68]
N. Zhiltsov and E. Agichtein. 2013. Improving entity search over linked data by modeling latent semantics. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1253--1256. ACM.

Cited By

View all
  • (2024)A bias study and an unbiased deep neural network for recommender systemsWeb Intelligence10.3233/WEB-23003622:1(15-29)Online publication date: 26-Mar-2024
  • (2024)Decentralized energy systems and blockchain technology: Implications for alleviating energy povertySustainable Energy Technologies and Assessments10.1016/j.seta.2024.10379565(103795)Online publication date: May-2024
  • (2024)Evolving energy landscapes: A computational analysis of the determinants of energy povertyRenewable and Sustainable Energy Reviews10.1016/j.rser.2024.114705202(114705)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 34, Issue 4
    September 2016
    217 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2954381
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 August 2016
    Accepted: 01 June 2016
    Revised: 01 April 2016
    Received: 01 November 2015
    Published in TOIS Volume 34, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Pseudo relevance feedback
    2. text mining
    3. topic modeling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • IBM Shared University (SUR) Award
    • Discovery grant and CREATE award from the Natural Sciences & Engineering Research Council (NSERC) of Canada
    • Early Researcher Award/Premiers Research Excellence Award
    • Information Retrieval and Knowledge Management Research Laboratory

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A bias study and an unbiased deep neural network for recommender systemsWeb Intelligence10.3233/WEB-23003622:1(15-29)Online publication date: 26-Mar-2024
    • (2024)Decentralized energy systems and blockchain technology: Implications for alleviating energy povertySustainable Energy Technologies and Assessments10.1016/j.seta.2024.10379565(103795)Online publication date: May-2024
    • (2024)Evolving energy landscapes: A computational analysis of the determinants of energy povertyRenewable and Sustainable Energy Reviews10.1016/j.rser.2024.114705202(114705)Online publication date: Sep-2024
    • (2022)Short Text Clustering Algorithms, Application and Challenges: A SurveyApplied Sciences10.3390/app1301034213:1(342)Online publication date: 27-Dec-2022
    • (2022)A probabilistic framework for integrating sentence-level semantics via BERT into pseudo-relevance feedbackInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10273459:1Online publication date: 9-Apr-2022
    • (2021)A Principled Approach Using Fuzzy Set Theory for Passage-Based Document RetrievalIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2020.299011029:7(1967-1977)Online publication date: Jul-2021
    • (2020)Dataless Text Classification with Pseudo Topic Representation2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00189(1255-1259)Online publication date: Nov-2020
    • (2020)Deep learning on information retrieval and its applicationsDeep Learning for Data Analytics10.1016/B978-0-12-819764-6.00008-9(125-153)Online publication date: 2020
    • (2020)Diversity and novelty in biomedical information retrievalBiomedical Information Technology10.1016/B978-0-12-816034-3.00012-2(369-396)Online publication date: 2020
    • (2020)Using Topic Modelling to Improve Prediction of Financial Report Commentary ClassesAdvances in Artificial Intelligence10.1007/978-3-030-47358-7_19(201-207)Online publication date: 13-May-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media