Abstract
The evaluation of Information Retrieval (IR) systems has recently been exploring the use of preference judgments over two lists of search results, presented side-by-side to judges. Such preference judgments have been shown to capture a richer set of relevance criteria than traditional methods of collecting relevance labels per single document. However, preference judgments over lists are expensive to obtain and are less reusable as any change to either side necessitates a new judgment. In this paper, we propose a way to measure the dissimilarity between two sides in side-by-side evaluation experiments and show how this measure can be used to prioritize queries to be judged in an offline setting. Our proposed measure, referred to as Weighted Ranking Difference (WRD), takes into account both the ranking differences and the similarity of the documents across the two sides, where a document may, for example, be a URL or a query suggestion. We empirically evaluate our measure on a large-scale, real-world dataset of crowdsourced preference judgments over ranked lists of auto-completion suggestions. We show that the WRD score is indicative of the probability of tie preference judgments and can, on average, save 25% of the judging resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. of the 29th ACM SIGIR Conference, SIGIR 2006, pp. 541–548. ACM, New York (2006)
Bailey, P., Craswell, N., White, R.W., Chen, L., Satyanarayana, A., Tahaghoghi, S.M.: Evaluating search systems using result page context. In: Proc. of the Third Symposium on Information Interaction in Context, IIiX 2010, pp. 105–114. ACM, New York (2010)
Bar-Ilan, J., Mat-Hassan, M., Levene, M.: Methods for comparing rankings of search engine results. Comput. Netw. 50(10), 1448–1463 (2006)
Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: Proc. of the 29th ACM SIGIR Conference, SIGIR 2006, pp. 268–275. ACM, New York (2006)
Chandar, P., Carterette, B.: Using preference judgments for novel document retrieval. In: Proc. of the 35th ACM SIGIR Conference, SIGIR 2012, pp. 861–870. ACM, New York (2012)
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM J. Discret. Math. 17(1), 134–160 (2004)
Guiver, J., Mizzaro, S., Robertson, S.: A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst. 27(4), 21:1–21:26 (2009)
Hosseini, M., Cox, I.J., Milic-Frayling, N., Vinay, V., Sweeting, T.: Selecting a subset of queries for acquisition of further relevance judgements. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 113–124. Springer, Heidelberg (2011)
Kim, J., Kazai, G., Zitouni, I.: Relevance dimensions in preference-based ir evaluation. In: Proc. of the 36th ACM SIGIR Conference, SIGIR 2013, pp. 913–916. ACM, New York (2013)
Radlinski, F., Bennett, P.N., Carterette, B., Joachims, T.: Redundancy, diversity and interdependent document relevance. SIGIR Forum 43(2), 46–52 (2009)
Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: Crestani, F., Marchand-Maillet, S., Chen, H.-H., Efthimiadis, E.N., Savoy, J. (eds.) SIGIR 2010, pp. 667–674. ACM (2010)
Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proc. of the 33rd ACM SIGIR Conference, SIGIR 2010, pp. 555–562. ACM, New York (2010)
Shieh, G.: A weighted kendall’s tau statistic. Statistics and Probability Letters 39, 17–24 (1998)
Thomas, P., Hawking, D.: Evaluation by comparing result sets in context. In: Proc. of the 15th ACM International Conference on Information and Knowledge Management, CIKM 2006, pp. 94–101. ACM, New York (2006)
Voorhees, E.M., Harman, D.K. (eds.): TREC: Experimentation and Evaluation in Information Retrieval. MIT Press (2005)
Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 20, 1–20 (2010)
Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proc. of the 31st ACM SIGIR Conference, SIGIR 2008, pp. 587–594. ACM, New York (2008)
Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. In: Proc. of the 32nd ACM SIGIR Conference, SIGIR 2009, pp. 802–803. ACM, New York (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kazai, G., Sung, H. (2014). Dissimilarity Based Query Selection for Efficient Preference Based IR Evaluation. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)