ABSTRACT
Offline evaluation for information retrieval aims to compare the performance of retrieval systems based on relevance judgments for a set of test queries. Since manual judgments are expensive, selective labeling has been developed to semi-automatically label documents, in the wake of the similarity relationship among retrieved documents. Intuitively, the agreement w.r.t the cluster hypothesis can directly determine the amount of manual judgments that can be saved by creating labels with a semi-automatic method. Meanwhile, in representing documents, certain information is lost. We argue that better document representation can lead to better agreement with the cluster hypothesis. To this end, we investigate different document representations on established benchmarks in the context of low-cost evaluation, showing that different document representations vary in how well they capture document similarity relative to a query.
- B. Carterette and J. Allan. Semiautomatic evaluation of retrieval systems using document similarities. CIKM 2007. Google ScholarDigital Library
- K. Hui and K. Berberich. Selective labeling and incomplete label mitigation for low-cost evaluation. SPIRE 2015.Google Scholar
- N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information storage and retrieval 1971.Google Scholar
- T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse processes 1998.Google Scholar
- D. M. Blei, A. Y. Ng and M. I. Jordan. Latent dirichlet allocation. JMLR 2003. Google ScholarDigital Library
- Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. ICML 2014.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS 2013.Google ScholarDigital Library
- E. M. Voorhees. The cluster hypothesis revisited. SIGIR 1985. Google ScholarDigital Library
Index Terms
- Cluster Hypothesis in Low-Cost IR Evaluation with Different Document Representations
Recommendations
Cluster-Based Document Retrieval with Multiple Queries
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information RetrievalThe merits of using multiple queries representing the same information need to improve retrieval effectiveness have recently been demonstrated in several studies. In this paper we present the first study of utilizing multiple queries in cluster-based ...
Testing the Cluster Hypothesis with Focused and Graded Relevance Judgments
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalThe cluster hypothesis is a fundamental concept in ad hoc retrieval. Heretofore, cluster hypothesis tests were applied to documents using binary relevance judgments. We present novel tests that utilize graded and focused relevance judgments; the latter ...
Using Text Segmentation to Enhance the Cluster Hypothesis
AIMSA '08: Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and ApplicationsAn alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of ...
Comments