ABSTRACT
Evaluation is crucial in assessing the effectiveness of new information retrieval and human computer interaction techniques and systems. Relevance judgements are often performed by humans, which makes obtaining them expensive and time consuming. Consequently, relevance judgements are usually performed only on a subset of a given collection of data or experimental results with a focus on the top ranked documents. However, when assessing the performance of exploratory search systems, the diversity or subjective relevance of documents that the user was presented with over a search session are often of more importance than the relative ranking of top documents. In order to perform these types of assessment, all the documents in a given collection need to be judged for relevance. In this paper, we propose an approach based on topic modeling that can greatly accelerate document relevance judgment of an entire document collection with an expert assessor needing to mark only a small subset of documents from a given collection. Experimental results show a substantial overlap between relevance judgments compared to a human assessor.
- K. Ahukorala, A. Medlar, K. Ilves, and D. Glowacka. Balancing exploration and exploitation: Empirical parameterization of exploratory search systems. In Proc. CIKM, pages 1703--1706. ACM, 2015. Google ScholarDigital Library
- D. Andrzejewski and D. Buttler. Latent topic feedback for information retrieval. In Proc. SIGKDD, pages 600--608. ACM, 2011. Google ScholarDigital Library
- K. Athukorala, D. Glowacka, G. Jacucci, A. Oulasvirta, and J. Vreeken. Is exploratory search different? a comparison of information search behavior for exploratory and lookup tasks. Journal of the Association for Information Science and Technology, 2015. Google ScholarDigital Library
- K. Athukorala, A. Medlar, A. Oulasvirta, G. Jacucci, and D. Glowacka. Beyond relevance: Adapting exploration/exploitation in information retrieval. In Proc. IUI, pages 359--369. ACM, 2016. Google ScholarDigital Library
- D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, 2012. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JLMR, 3:993--1022, 2003. Google ScholarDigital Library
- C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information retrieval, 10(6):491--508, 2007. Google ScholarDigital Library
- G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient construction of large test collections. In Proc. SIGIR, pages 282--289. ACM, 1998. Google ScholarDigital Library
- D. Glowacka, T. Ruotsalo, K. Konyushkova, K. Athukorala, S. Kaski, and G. Jacucci. Directing exploratory search: Reinforcement learning from user interactions with keywords. In Proc. IUI, pages 117--128. ACM, 2013. Google ScholarDigital Library
- D. Greene, D. O'Callaghan, and P. Cunningham. How many topics? stability analysis for topic models. In Proc. ECML PKDD, 2014.Google ScholarCross Ref
- K. L. Gwet. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC, 2014.Google Scholar
- A. Kangasraasio, D. Glowacka, and S. Kaski. Improving controllability and predictability of interactive recommendation interfaces for exploratory search. In Proc. IUI, pages 247--251. ACM, 2015. Google ScholarDigital Library
- D. E. Losada, J. Parapar, and A. Barreiro. Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In Proc. SAC, pages 1027--1034. ACM, 2016. Google ScholarDigital Library
- G. Marchionini. Exploratory search: from finding to understanding. Communications of the ACM, 49(4):41--46, 2006. Google ScholarDigital Library
- A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
- A. Medlar, K. Ilves, P. Wang, W. Buntine, and D. Glowacka. PULP: A system for exploratory search of scientific literature. In Proc. SIGIR, pages 1133--1136. ACM, 2016. Google ScholarDigital Library
- A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382. ACM, 2007. Google ScholarDigital Library
- M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169. ACM, 2005. Google ScholarDigital Library
- E. M. Voorhees, D. K. Harman, et al. TREC: Experiment and evaluation in information retrieval. MIT Press, 2005. Google ScholarDigital Library
- X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proc. SIGIR. ACM, 2006. Google ScholarDigital Library
- R. W. White, G. Marchionini, and G. Muresan. Evaluating exploratory search systems: Introduction to special topic issue of information processing and management. Information Processing & Management, 44(2):433--436, 2008. Google ScholarDigital Library
- X. Yi and J. Allan. A comparative study of utilizing topic models for information retrieval. In Proc. ECIR, pages 29--41, 2009. Google ScholarDigital Library
Index Terms
- Using Topic Models to Assess Document Relevance in Exploratory Search User Studies
Recommendations
The Influence of Topic Difficulty, Relevance Level, and Document Ordering on Relevance Judging
ADCS '16: Proceedings of the 21st Australasian Document Computing SymposiumJudging the relevance of documents for an information need is an activity that underpins the most widely-used approach in the evaluation of information retrieval systems. In this study we investigate the relationship between how long it takes an ...
Promoting Document Relevance Using Query Term Proximity for Exploratory Search
In the information retrieval system, relevance manifestation is pivotal and regularly based on document-term statistics, i.e., term frequency (tf), inverse document frequency (idf), etc. Query term proximity (QTP) within matched documents is mostly ...
Sound and complete relevance assessment for XML retrieval
In information retrieval research, comparing retrieval approaches requires test collections consisting of documents, user requests and relevance assessments. Obtaining relevance assessments that are as sound and complete as possible is crucial for the ...
Comments