A Short Survey on Online and Offline Methods for Search Quality Evaluation

Kanoulas, Evangelos

doi:10.1007/978-3-319-41718-9_3

A Short Survey on Online and Offline Methods for Search Quality Evaluation

Evangelos Kanoulas¹⁷

Chapter
First Online: 26 July 2016

812 Accesses
3 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 573))

Abstract

Evaluation has always been the cornerstone of scientific development. Scientists come up with hypotheses (models) to explain physical phenomena, and validate these models by comparing their output to observations in nature. A scientific field consists then merely by a collection of hypotheses that could not been disproved (yet) when compared to nature. Evaluation plays the exact key role in the field of information retrieval. Researchers and practitioners develop models to explain the relation between an information need expressed by a person and information contained in available resources, and test these models by comparing their outcomes to collections of observations.

This article is a short survey on methods, measures, and designs used in the field of Information Retrieval to evaluate the quality of search algorithms (aka the implementation of a model) against collections of observations. The phrase “search quality” has more than one interpretations, however here I will only discuss one of these interpretations, the effectiveness of a search algorithm to find the information requested by a user. There are two types of collections of observations used for the purpose of evaluation: (a) relevance annotations, and (b) observable user behaviour. I will call the evaluation framework based on the former a collection-based evaluation, while the one based on the latter an in-situ evaluation.

This survey is far from complete; it only presents my personal viewpoint on the recent developments in the field.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Retrieval systems and search engines are used interchangeably in this paper.
2.
Text REtrieval Conference.
3.
See the TREC Crowdsourcing track: https://sites.google.com/site/treccrowd/.
4.
http://ir.cis.udel.edu/sessions/.
5.
A tutorial on the topic has also been given by Carterette [27, 28].
6.
Also known as split testing, control/treatment testing, bucket testing, randomised experiments, and online field experiments.
7.
Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, NetFlix, Shop Direct, Yahoo!, Zynga have reported performing A/B tests.
8.
“If you torture the data enough, it will confess to anything”, Ronald Harry Coase.

References

Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM, pp. 5–14 (2009)
Google Scholar
Al-Harbi, A.L., Smucker, M.D.: A qualitative exploration of secondary assessor relevance judging behavior. In: Proceedings of the 5th Information Interaction in Context Symposium, IIiX 2014, pp. 195–204. ACM, New York (2014). http://doi.acm.org/10.1145/2637002.2637025
Allan, J., Carterette, B., Dachev, B., Aslam, J.A., Pavlu, V., Kanoulas, E.: Million query track 2007 overview. In: Proceedings of the Sixteenth Text REtrieval Conference, TREC 2007, Gaithersburg, Maryland, USA, 5–9 November 2007. http://trec.nist.gov/pubs/trec16/papers/1MQ.OVERVIEW16.pdf
Alonso, O., Baeza-Yates, R.: Design and implementation of relevance assessments using crowdsourcing. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 153–164. Springer, Heidelberg (2011). http://dx.doi.org/10.1007/978-3-642-20161-5_16
Chapter Google Scholar
Alonso, O., Mizzaro, S.: Using crowdsourcing for trec relevance assessment. Inf. Process. Manage. 48(6), 1053–1066 (2012). http://dx.doi.org/10.1016/j.ipm.2012.01.004
Article Google Scholar
Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document organization tasks. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, pp. 643–652. ACM, New York (2013). http://doi.acm.org/10.1145/2484028.2484081
Ashkan, A., Clarke, C.L.: On the informativeness of cascade and intent-aware effectiveness measures. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 407–416. ACM, New York (2011). http://doi.acm.org/10.1145/1963405.1963464
Aslam, J.A., Pavlu, V., Savell, R.: A unified model for metasearch, pooling, and system evaluation. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM 2003, pp. 484–491. ACM, New York (2003). http://doi.acm.org/10.1145/956863.956953
Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, pp. 541–548, 6–11 August 2006. http://doi.acm.org/10.1145/1148170.1148263
Aslam, J.A., Savell, R.: On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 361–362. ACM, New York (2003). http://doi.acm.org/10.1145/860435.860501
Aslam, J.A., Yilmaz, E.: Inferring document relevance via average precision. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 601–602. ACM, New York (2006). http://doi.acm.org/10.1145/1148170.1148275
Aslam, J.A., Yilmaz, E.: Inferring document relevance from incomplete information. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, pp. 633–642, 6–10 November 2007. http://doi.acm.org/10.1145/1321440.1321529
Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2005, pp. 27–34. ACM, New York (2005). http://doi.acm.org/10.1145/1076034.1076042
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, pp. 667–674, 20–24 July 2008. http://doi.acm.org/10.1145/1390334.1390447
Bakshy, E., Eckles, D.: Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 1303–1311. ACM, New York (2013). http://doi.acm.org/10.1145/2487575.2488218
Bakshy, E., Eckles, D., Bernstein, M.S.: Designing and deploying online field experiments. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, pp. 283–292. ACM, New York (2014). http://doi.acm.org/10.1145/2566486.2567967
Baskaya, F., Keskustalo, H., Järvelin, K.: Simulating simple and fallible relevance feedback. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 593–604. Springer, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=1996889.1996965
Chapter Google Scholar
Baskaya, F., Keskustalo, H., Järvelin, K.: Time drives interaction: simulating sessions in diverse searching environments. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 105–114. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348301
Belkin, N.J.: Salton award lecture: people, interacting with information. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, pp. 1–2, 9–13 August 2015. http://doi.acm.org/10.1145/2766462.2767854
Berto, A., Mizzaro, S., Robertson, S.: On using fewer topics in information retrieval evaluations. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, ICTIR 2013, pp. 9:30–9:37. ACM, New York (2013). http://doi.acm.org/10.1145/2499178.2499184
Bilgic, M., Bennett, P.N.: Active query selection for learning rankers. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1033–1034. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348455
Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S., Tran Duc, T.: Repeatable and reliable search system evaluation using crowdsourcing. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 923–932. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2010039
Busin, L., Mizzaro, S.: Axiometrics: an axiomatic approach to information retrieval effectiveness metrics. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, ICTIR 2013, pp. 8:22–8:29. ACM, New York (2013). http://doi.acm.org/10.1145/2499178.2499182
Carterette, B.: Robust test collections for retrieval evaluation. In: SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 55–62, 23–27 July 2007. http://doi.acm.org/10.1145/1277741.1277754
Carterette, B.: System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 903–912. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2010037
Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30(1), 4:1–4:34 (2012). http://doi.acm.org/10.1145/2094072.2094076
Article Google Scholar
Carterette, B.: Statistical significance testing in information retrieval: theory and practice. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, ICTIR 2013, p. 2:2. ACM, New York (2013). http://doi.acm.org/10.1145/2499178.2499204
Carterette, B.: Statistical significance testing in information retrieval: theory and practice. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014, p. 1286. ACM, New York (2014). http://doi.acm.org/10.1145/2600428.2602292
Carterette, B., Allan, J., Sitaraman, R.K.: Minimal test collections for retrieval evaluation. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, pp. 268–275, 6–11 August 2006. http://doi.acm.org/10.1145/1148170.1148219
Carterette, B., Bah, A., Zengin, M.: Dynamic test collections for retrieval evaluation. In: Proceedings of the 2015 International Conference on the Theory of Information Retrieval, ICTIR 2015, pp. 91–100. ACM, New York (2015). http://doi.acm.org/10.1145/2808194.2809470
Carterette, B., Kanoulas, E., Pavlu, V., Fang, H.: Reusable test collections through experimental design. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, pp. 547–554, 19–23 July 2010. http://doi.acm.org/10.1145/1835449.1835541
Carterette, B., Kanoulas, E., Yilmaz, E.: Simulating simple user behavior for system effectiveness evaluation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 611–620. ACM, New York (2011). http://doi.acm.org/10.1145/2063576.2063668
Carterette, B., Kanoulas, E., Yilmaz, E.: Incorporating variability in user behavior into systems based evaluation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 135–144. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2396782
Carterette, B., Pavlu, V., Fang, H., Kanoulas, E.: Million query track 2009 overview. In: Proceedings of The Eighteenth Text REtrieval Conference, TREC 2009, Gaithersburg, Maryland, USA, 17–20 November 2009. http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, pp. 651–658, 20–24 July 2008. http://doi.acm.org/10.1145/1390334.1390445
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: If I had a million queries. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 288–300. Springer, Heidelberg (2009). http://dx.doi.org/10.1007/978-3-642-00958-7_27
Chapter Google Scholar
Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 539–546. ACM, New York (2010). http://doi.acm.org/10.1145/1835449.1835540
Chakraborty, S., Radlinski, F., Shokouhi, M., Baecke, P.: On correlation of absence time and search effectiveness. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014, pp. 1163–1166. ACM, New York (2014). http://doi.acm.org/10.1145/2600428.2609535
Chandar, P., Webber, W., Carterette, B.: Document features predicting assessor disagreement. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, pp. 745–748. ACM, New York (2013). http://doi.acm.org/10.1145/2484028.2484161
Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based diversification of web search results: metrics and algorithms. Inf. Retr. 14(6), 572–592 (2011)
Article Google Scholar
Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 6:1–6:41 (2012). http://doi.acm.org/10.1145/2094072.2094078
Article Google Scholar
Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 621–630. ACM, New York (2009). http://doi.acm.org/10.1145/1645953.1646033
Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, San Rafael (2015). http://dx.doi.org/10.2200/S00654ED1V01Y201507ICR043
Google Scholar
Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, San Rafael (2015). http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf
Google Scholar
Chuklin, A., Serdyukov, P., de Rijke, M.: Click model-based information retrieval metrics. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, pp. 493–502. ACM, New York (2013). http://doi.acm.org/10.1145/2484028.2484071
Chuklin, A., Zhou, K., Schuth, A., Sietsma, F., de Rijke, M.: Evaluating intuitiveness of vertical-aware click models. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2014, pp. 1075–1078. ACM, New York (2014). http://doi.acm.org/10.1145/2600428.2609513
Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 659–666. ACM, New York (2008)
Google Scholar
Cormack, G.V., Palmer, C.R., Clarke, C.L.A.: Efficient construction of large test collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 282–289. ACM, New York (1998). http://doi.acm.org/10.1145/290941.291009
Craswell, N., Szummer, M.: Random walks on the click graph. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 239–246. ACM, New York (2007). http://doi.acm.org/10.1145/1277741.1277784
Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM 2008, pp. 87–94. ACM, New York (2008). http://doi.acm.org/10.1145/1341531.1341545
Crook, T., Frasca, B., Kohavi, R., Longbotham, R.: Seven pitfalls to avoid when running controlled experiments on the web. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1105–1114. ACM, New York (2009). http://doi.acm.org/10.1145/1557019.1557139
Dang, V., Xue, X., Croft, W.B.: Inferring query aspects from reformulations using clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 2117–2120. ACM, New York (2011). http://doi.acm.org/10.1145/2063576.2063904
Demartini, G., Mizzaro, S.: A classification of IR effectiveness metrics. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 488–491. Springer, Heidelberg (2006). http://dx.doi.org/10.1007/11735106_48
Chapter Google Scholar
Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Trieschnigg, D., Develder, C.: Exploiting user disagreement for web search evaluation: an experimental approach. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM 2014, pp. 33–42. ACM, New York (2014). http://doi.acm.org/10.1145/2556195.2556268
Deng, A., Hu, V.: Diluted treatment effect estimation for trigger analysis in online controlled experiments. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, pp. 349–358. ACM, New York (2015). http://doi.acm.org/10.1145/2684822.2685307
Deng, A., Li, T., Guo, Y.: Statistical inference in two-stage online controlled experiments with treatment selection and validation. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, pp. 609–618. ACM, New York (2014). http://doi.acm.org/10.1145/2566486.2568028
Deng, A., Xu, Y., Kohavi, R., Walker, T.: Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 123–132. ACM, New York (2013). http://doi.acm.org/10.1145/2433396.2433413
Diriye, A., White, R., Buscher, G., Dumais, S.: Leaving so soon?: understanding and predicting web search abandonment rationales. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 1025–1034. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2398399
Drutsa, A., Gusev, G., Serdyukov, P.: Future user engagement prediction and its application to improve the sensitivity of online experiments. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 256–266. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2015). http://dx.doi.org/10.1145/2736277.2741116
Dupret, G.E., Piwowarski, B.: A user browsing model to predict search engine click data from past observations. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 331–338. ACM, New York (2008). http://doi.acm.org/10.1145/1390334.1390392
Efron, M.: Using multiple query aspects to build test collections without human relevance judgments. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 276–287. Springer, Heidelberg (2009). http://dx.doi.org/10.1007/978-3-642-00958-7_26
Chapter Google Scholar
Ferrante, M., Ferro, N., Maistro, M.: Injecting user models and time into precision via markov chains. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014, pp. 597–606. ACM, New York (2014). http://doi.acm.org/10.1145/2600428.2609637
Fox, S., Karnawat, K., Mydland, M., Dumais, S., White, T.: Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23(2), 147–168 (2005). http://doi.acm.org/10.1145/1059981.1059982
Article Google Scholar
Grotov, A., Chuklin, A., Markov, I., Stout, L., Xumara, F., de Rijke, M.: A comparative study of click models for web search. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 78–90. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24027-5_7
Chapter Google Scholar
Grotov, A., Whiteson, S., de Rijke, M.: Bayesian ranker comparison based on historical user interactions. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 273–282. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767730
Guiver, J., Mizzaro, S., Robertson, S.: A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst. 27(4), 21:1–21:26 (2009). http://doi.acm.org/10.1145/1629096.1629099
Article Google Scholar
Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 11–20. ACM, New York (2009). http://doi.acm.org/10.1145/1526709.1526712
Guo, F., Liu, C., Wang, Y.M.: Efficient multiple-click models in web search. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM 2009, pp. 124–131. ACM, New York (2009). http://doi.acm.org/10.1145/1498759.1498818
Guo, Q., Agichtein, E.: Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 569–578. ACM, New York (2012). http://doi.acm.org/10.1145/2187836.2187914
Guo, Y., Deng, A.: Flexible Online Repeated Measures Experiment. ArXiv e-prints, January 2015
Google Scholar
Harman, D., Voorhees, E.M.: TREC: an overview. ARIST 40(1), 113–155 (2006). http://dx.doi.org/10.1002/aris.1440400111
Google Scholar
Hassan, A., Shi, X., Craswell, N., Ramsey, B.: Beyond clicks: query reformulation as a predictor of search satisfaction. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013, pp. 2019–2028. ACM, New York (2013). http://doi.acm.org/10.1145/2505515.2505682
Hauff, C., Hiemstra, D., Azzopardi, L., de Jong, F.: A case for automatic system evaluation. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 153–165. Springer, Heidelberg (2010). http://dx.doi.org/10.1007/978-3-642-12275-0_16
Chapter Google Scholar
He, J., Zhai, C., Li, X.: Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 2029–2032. ACM, New York (2009). http://doi.acm.org/10.1145/1645953.1646293
Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring preferences from clicks. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 249–258. ACM, New York (2011). http://doi.acm.org/10.1145/2063576.2063618
Hofmann, K., Whiteson, S., de Rijke, M.: Estimating interleaved comparison outcomes from historical click data. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 1779–1783. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2398516
Hosseini, M., Cox, I., Milic-Frayling, N.: Optimizing the cost of information retrieval testcollections. In: Proceedings of the 4th Workshop on Workshop for Ph.D. Students in Information and Knowledge Management, PIKM 2011, pp. 79–82. ACM, New York (2011). http://doi.acm.org/10.1145/2065003.2065020
Hosseini, M., Cox, I.J., Milic-Frayling, N., Shokouhi, M., Yilmaz, E.: An uncertainty-aware query selection model for evaluation of IR systems. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 901–910. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348403
Hosseini, M., Cox, I.J., Milic-Frayling, N., Vinay, V., Sweeting, T.: Selecting a subset of queries for acquisition of further relevance judgements. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 113–124. Springer, Heidelberg (2011)
Chapter Google Scholar
Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., Zheng, Q.: Mining query subtopics from search log data. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 305–314. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348327
Huang, J., White, R.W., Dumais, S.: No clicks, no problem: using cursor movements to understand and improve search. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2011, pp. 1225–1234. ACM, New York (2011). http://doi.acm.org/10.1145/1978942.1979125
Järvelin, K., Price, S.L., Delcambre, L.M.L., Nielsen, M.L.: Discounted cumulated gain based evaluation of multiple-query IR sessions. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 4–15. Springer, Heidelberg (2008). http://dl.acm.org/citation.cfm?id=1793274.1793280
Chapter Google Scholar
Jiang, J., He, D., Han, S., Yue, Z., Ni, C.: Contextual evaluation of query reformulations in a search session by user simulation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 2635–2638. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2398710
Joachims, T.: Evaluating retrieval performance using clickthrough data. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, pp. 79–96. Physica/Springer Verlag, New York (2003)
Google Scholar
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2005, pp. 154–161. ACM, New York (2005). http://doi.acm.org/10.1145/1076034.1076063
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 1–26 (2007). http://doi.acm.org/10.1145/1229179.1229181
Article Google Scholar
Kanoulas, E., Aslam, J.A.: Empirical justification of the gain and discount function for ndcg. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 611–620. ACM, New York (2009). http://doi.acm.org/10.1145/1645953.1646032
Kanoulas, E., Carterette, B., Clough, P.D., Sanderson, M.: Evaluating multi-query sessions. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 1053–1062. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2010056
Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 165–176. Springer, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=1996889.1996911
Chapter Google Scholar
Kazai, G., Craswell, N., Yilmaz, E., Tahaghoghi, S.: An analysis of systematic judging errors in information retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 105–114. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2396779
Kazai, G., Kamps, J., Milic-Frayling, N.: An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr. 16(2), 138–178 (2013). http://dx.doi.org/10.1007/s10791-012-9205-0
Article Google Scholar
Kazai, G., Yilmaz, E., Craswell, N., Tahaghoghi, S.M.M.: User intent and assessor disagreement in web search evaluation. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013, San Francisco, CA, USA, pp. 699–708, 27 October–1 November 2013. http://doi.acm.org/10.1145/2505515.2505716
Kelly, D., Belkin, N.J.: Display time as implicit feedback: understanding task effects. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2004, pp. 377–384. ACM, New York (2004). http://doi.acm.org/10.1145/1008992.1009057
Kharitonov, V., Macdonald, S., Ounis: sequential testing for early stopping of online experiments. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015. ACM, New York (2015)
Google Scholar
Kharitonov, E., Macdonald, C., Serdyukov, P., Ounis, I.: Generalized team draft interleaving. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 773–782. ACM, New York (2015). http://doi.acm.org/10.1145/2806416.2806477
Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM 2014, pp. 193–202. ACM, New York (2014). http://doi.acm.org/10.1145/2556195.2556220
Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., Xu, Y.: Trustworthy online controlled experiments: five puzzling outcomes explained. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, pp. 786–794. ACM, New York (2012). http://doi.acm.org/10.1145/2339530.2339653
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., Pohlmann, N.: Online controlled experiments at large scale. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 1168–1176. ACM, New York (2013). http://doi.acm.org/10.1145/2487575.2488217
Kohavi, R., Deng, A., Longbotham, R., Xu, Y.: Seven rules of thumb for web site experimenters. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1857–1866. ACM, New York (2014). http://doi.acm.org/10.1145/2623330.2623341
Kohavi, R., Longbotham, R.: Online controlled experiments and A/B tests. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning and Data Mining (2015)
Google Scholar
Kohavi, R., Longbotham, R., Sommerfield, D., Henne, R.: Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Disc. 18(1), 140–181 (2009). http://dx.doi.org/10.1007/s10618-008-0114-1
Article MathSciNet Google Scholar
Lagun, D., Ageev, M., Guo, Q., Agichtein, E.: Discovering common motifs in cursor movement data for improving web search. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM 2014, pp. 183–192. ACM, New York (2014). http://doi.acm.org/10.1145/2556195.2556265
Lease, M., Yilmaz, E.: Crowdsourcing for information retrieval. SIGIR Forum 45(2), 66–75 (2012). http://doi.acm.org/10.1145/2093346.2093356
Article Google Scholar
Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics for search engines. CoRR abs/1403.1891 (2014). http://arxiv.org/abs/1403.1891
Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics in search engines: a case study. In: Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015 Companion, pp. 929–934. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2015). http://dx.doi.org/10.1145/2740908.2742562
Li, L., Kim, J.Y., Zitouni, I.: Toward predicting the outcome of an a/b experiment for search relevance. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, pp. 37–46. ACM, New York (2015). http://doi.acm.org/10.1145/2684822.2685311
Liu, Y., Chen, Y., Tang, J., Sun, J., Zhang, M., Ma, S., Zhu, X.: Different users, different opinions: predicting search satisfaction with mouse movement information. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 493–502. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767721
Maddalena, E., Mizzaro, S., Scholer, F., Turpin, A.: Judging relevance using magnitude estimation. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 215–220. Springer, Heidelberg (2015). http://dx.doi.org/10.1007/978-3-319-16354-3_23
Google Scholar
Megorskaya, O., Kukushkin, V., Serdyukov, P.: On the relation between assessor’s agreement and accuracy in gamified relevance assessment. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 605–614. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767727
Mehrotra, R., Yilmaz, E.: Representative & informative query selection for learning to rank using submodular functions. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 545–554. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767753
Metrikov, P., Pavlu, V., Aslam, J.A.: Impact of assessor disagreement on ranking performance. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1091–1092. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348484
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2:1–2:27 (2008). http://doi.acm.org/10.1145/1416950.1416952
Article Google Scholar
Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manage. 42(3), 595–614 (2006). http://dx.doi.org/10.1016/j.ipm.2005.03.023
Article MATH Google Scholar
Pavlu, V., Rajput, S., Golbus, P.B., Aslam, J.A.: IR system evaluation using nugget-based test collections. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM 2012, pp. 393–402. ACM, New York (2012). http://doi.acm.org/10.1145/2124295.2124343
Pearl, J.: Comment: understanding simpson’s paradox. Am. Stat. 68(1), 8–13 (2014). http://EconPapers.repec.org/RePEc:taf:amstat:v:68:y:2014:i:1:p:8–13
Article MathSciNet Google Scholar
Qian, Y., Sakai, T., Ye, J., Zheng, Q., Li, C.: Dynamic query intent mining from a search log stream. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013, pp. 1205–1208. ACM, New York (2013). http://doi.acm.org/10.1145/2505515.2507856
Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 667–674. ACM, New York (2010). http://doi.acm.org/10.1145/1835449.1835560
Radlinski, F., Craswell, N.: Optimized interleaving for online retrieval evaluation. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 245–254. ACM, New York (2013). http://doi.acm.org/10.1145/2433396.2433429
Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 43–52. ACM, New York (2008). http://doi.acm.org/10.1145/1458082.1458092
Radlinski, F., Szummer, M., Craswell, N.: Inferring query intent from reformulations and clicks. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1171–1172. ACM, New York (2010). http://doi.acm.org/10.1145/1772690.1772859
Robertson, S.: On the contributions of topics to system evaluation. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 129–140. Springer, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=1996889.1996908
Chapter Google Scholar
Robertson, S.E., Kanoulas, E.: On per-topic variance in IR evaluation. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 891–900. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348402
Sakai, T.: Bootstrap-based comparisons of IR metrics for finding one relevant document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006). http://dx.doi.org/10.1007/11880592_29
Chapter Google Scholar
Sakai, T.: Designing test collections for comparing many systems. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 61–70. ACM, New York (2014). http://doi.acm.org/10.1145/2661829.2661893
Sakai, T., Dou, Z., Clarke, C.L.: The impact of intent selection on diversified search evaluation. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, pp. 921–924. ACM, New York (2013). http://doi.acm.org/10.1145/2484028.2484105
Sakai, T., Song, R.: Evaluating diversified search results using per-intent graded relevance. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 1043–1052. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2010055
Sanderson, M.: Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retrieval 4(4), 247–375 (2010). http://dx.doi.org/10.1561/1500000009
Article MATH Google Scholar
Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 555–562. ACM, New York (2010). http://doi.acm.org/10.1145/1835449.1835542
Schaer, P.: Better than their reputation? On the reliability of relevance assessments with students. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 124–135. Springer, Heidelberg (2012). http://dx.doi.org/10.1007/978-3-642-33247-0_14
Google Scholar
Scholer, F., Turpin, A., Sanderson, M.: Quantifying test collection quality based on the consistency of relevance judgements. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 1063–1072. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2010057
Schuth, A., Bruintjes, R.J., Buüttner, F., van Doorn, J., Groenland, C., Oosterhuis, H., Tran, C.N., Veeling, B., van der Velde, J., Wechsler, R., Woudenberg, D., de Rijke, M.: Probabilistic multileave for online retrieval evaluation. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 955–958. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767838
Schuth, A., Hofmann, K., Radlinski, F.: Predicting search satisfaction metrics with interleaved comparisons. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 463–472. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767695
Schuth, A., Sietsma, F., Whiteson, S., Lefortier, D., de Rijke, M.: Multileaved comparisons for fast online evaluation. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 71–80. ACM, New York (2014). http://doi.acm.org/10.1145/2661829.2661952
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 623–632. ACM, New York (2007). http://doi.acm.org/10.1145/1321440.1321528
Smucker, M.D., Allan, J., Carterette, B.: Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 630–631. ACM, New York (2009). http://doi.acm.org/10.1145/1571941.1572050
Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 95–104. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348300
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pp. 66–73. ACM, New York (2001). http://doi.acm.org/10.1145/383952.383961
Song, Y., Shi, X., Fu, X.: Evaluating and predicting user engagement change with degraded search relevance. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW 2013, pp. 1213–1224. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2013). http://dl.acm.org/citation.cfm?id=2488388.2488494
Tang, D., Agarwal, A., O’Brien, D., Meyer, M.: Overlapping experiment infrastructure: more, better, faster experimentation. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 17–26. ACM, New York (2010). http://doi.acm.org/10.1145/1835804.1835810
Turpin, A., Scholer, F., Mizzaro, S., Maddalena, E.: The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 565–574. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767760
Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 125–134. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2396781
Wu, S., Crestani, F.: Methods for ranking information retrieval systems without relevance judgments. In: Proceedings of the 2003 ACM Symposium on Applied Computing, SAC 2003, pp. 811–816. ACM, New York (2003). http://doi.acm.org/10.1145/952532.952693
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, pp. 102–111, 6–11 November 2006. http://doi.acm.org/10.1145/1183614.1183633
Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, pp. 603–610, 20–24 July 2008. http://doi.acm.org/10.1145/1390334.1390437
Yilmaz, E., Kanoulas, E., Craswell, N.: Effect of intent descriptions on retrieval evaluation. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 599–608. ACM, New York (2014). http://doi.acm.org/10.1145/2661829.2661950
Yilmaz, E., Kazai, G., Craswell, N., Tahaghoghi, S.M.: On judgments obtained from a commercial search engine. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1115–1116. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348496
Yilmaz, E., Shokouhi, M., Craswell, N., Robertson, S.: Expected browsing utility for web search evaluation. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 1561–1564. ACM, New York (2010). http://doi.acm.org/10.1145/1871437.1871672
Yilmaz, E., Verma, M., Craswell, N., Radlinski, F., Bailey, P.: Relevance and effort: an analysis of document utility. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 91–100. ACM, New York (2014). http://doi.acm.org/10.1145/2661829.2661953
Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., Joachims, T.: Learning more powerful test statistics for click-based retrieval evaluation. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 507–514. ACM, New York (2010). http://doi.acm.org/10.1145/1835449.1835534
Zhang, Y., Park, L.A., Moffat, A.: Click-based evidence for decaying weight distributions in search effectiveness metrics. Inf. Retr. 13(1), 46–69 (2010). http://dx.doi.org/10.1007/s10791-009-9099-7
Article Google Scholar
Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 802–803. ACM, New York (2009). http://doi.acm.org/10.1145/1571941.1572136

Download references

Acknowledgements

This work is based on a tutorial I gave at the 2015 Russian Summer School in Information Retrieval (RuSSIR 2015). I would like to thank Ben Carterette, Emine Yilmaz, Anne Schuth, Katja Hofmann, and Filip Radlinski for sharing references and material used in that tutorial and hence as the basis for this survey.

Author information

Authors and Affiliations

Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
Evangelos Kanoulas

Authors

Evangelos Kanoulas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evangelos Kanoulas .

Editor information

Editors and Affiliations

Ural Federal University , Yekaterinburg, Russia
Pavel Braslavski
University of Amsterdam, Amsterdam, The Netherlands
Ilya Markov
University of Florida , Gainsville, Florida, USA
Panos Pardalos
Eurecat , Barcelona, Spain
Yana Volkovich
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
National Research University Higher School of Economics, Saint Petersburg, Russia
Sergei Koltsov
National Research University Higher School of Economics, Saint Petersburg, Russia
Olessia Koltsova

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kanoulas, E. (2016). A Short Survey on Online and Offline Methods for Search Quality Evaluation. In: Braslavski, P., et al. Information Retrieval. RuSSIR 2015. Communications in Computer and Information Science, vol 573. Springer, Cham. https://doi.org/10.1007/978-3-319-41718-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-41718-9_3
Published: 26 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41717-2
Online ISBN: 978-3-319-41718-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics