ABSTRACT
We present a novel approach for efficiently evaluating the performance of retrieval models and introduce two evaluation metrics: Distributional Overlap (DO), which compares the clustering of scores of relevant and non-relevant documents, and Histogram Slope Analysis (HSA), which examines the log of the empirical distributions of relevant and non-relevant documents. Unlike rank evaluation metrics such as mean average precision (MAP) and normalized discounted cumulative gain (NDCG), DO and HSA only require calculating model scores of queries and a fixed sample of relevant and non-relevant documents rather than scoring the entire collection, even implicitly by means of an inverted index. In experimental meta-evaluations, we find that HSA achieves high correlation with MAP and NDCG on a monolingual and a cross-language document similarity task; on four ad-hoc web retrieval tasks; and on an analysis of ten TREC tasks from the past ten years. In addition, when evaluating latent Dirichlet allocation (LDA) models on document similarity tasks, HSA achieves better correlation with MAP and NCDG than perplexity, an intrinsic metric widely used with topic models.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3: 993--1022, 2003. Google ScholarDigital Library
- W. Croft. A model of cluster searching based on classification. Information Systems, 5 (3): 189--195, 1980.Google ScholarCross Ref
- N. Jardine and C. J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7: 217--240, 1971.Google ScholarCross Ref
- K. Krstovski and D. A. Smith. Online polylingual topic models for fast document translation detection. In WMT'11, pages 252--261, 2013.Google Scholar
- D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP'09, pages 880--889, 2009. Google ScholarDigital Library
- F. Raiber and O. Kurland. The correlation between cluster hypothesis tests and the effectiveness of cluster-based retrieval. In SIGIR '14, pages 1155--1158, 2014. Google ScholarDigital Library
- M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In CIKM '07, pages 623--632, 2007. Google ScholarDigital Library
- C. van Rijsbergen. Automatic Information Structuring and Retrieval. PhD thesis, University of Cambridge, 1972.Google Scholar
- E. M. Voorhees. The cluster hypothesis revisited. In SIGIR '85, pages 188--196, 1985. Google ScholarDigital Library
- X. Xue and W. B. Croft. Transforming patents into prior-art queries. In SIGIR '09, pages 808--809, 2009. Google ScholarDigital Library
Index Terms
Evaluating Retrieval Models through Histogram Analysis
Recommendations
On the analysis and evaluation of information retrieval models for social book search
AbstractSocial Book Search (SBS) studies how the Social Web impacts book retrieval. This impact is studied in two steps. In this first step, called the baseline run, the search index having bibliographic descriptions or professional metadata and user-...
Axiomatic Analysis and Optimization of Information Retrieval Models
ICTIR '13: Proceedings of the 2013 Conference on the Theory of Information RetrievalThe accuracy of a search engine is mostly determined by the optimality of the retrieval model used in the search engine. Develoing optimal retrieval models has always been a very important fundamental research problem in information retrieval because an ...
Evaluating patent retrieval in the third NTCIR workshop
Special issue: Formal methods for information retrievalReflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections ...
Comments