Abstract
Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation, which is based on manual relevance assessments. In particular, online evaluation can enable comparisons in settings where reliable assessments are difficult to obtain (e.g., personalized search) or expensive (e.g., for search by trained experts in specialized collections).
Despite its advantages, and its successful use in commercial settings, online evaluation is rarely employed outside of large commercial search engines due to a perception that it is impractical at small scales. The goal of this tutorial is to show how online evaluations can be conducted in such settings, demonstrate software to facilitate its use, and promote further research in the area. We will also contrast online evaluation with standard offline evaluation, and provide an overview of online approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agichtein, E., Brill, E., Dumais, S., Ragno, R.: Learning user interaction models for predicting web search result preferences. In: SIGIR 2006, pp. 3–10 (2006)
Allan, J., Aslam, J.A., Carterette, B., Pavlu, V., Kanoulas, E.: Million query track 2008 overview. In: TREC 2008 (2008)
Carterette, B., Bennett, P.N., Chickering, D.M., Dumais, S.T.: Here or there: Preference Judgments for Relevance. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 16–27. Springer, Heidelberg (2008)
Carterette, B., Jones, R.: Evaluating search engines by modeling the relationship between relevance and clicks. In: NIPS 2007 (2007)
Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 6:1–6:41 (2012)
Clarke, C., Agichtein, E., Dumais, S., White, R.: The influence of caption features on clickthrough patterns in web search. In: SIGIR 2007, pp. 135–142 (2007)
Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008 (2008)
Dupret, G., Murdock, V., Piwowarski, B.: Web search engine evaluation using clickthrough data and a user model. In: WWW Wksp. on Query Log Analysis (2007)
Hardtke, D., Wertheim, M., Cramer, M.: Demonstration of improved search result relevancy using real-time implicit relevance feedback. In: SIGIR Wksp. on Understanding the User (2009)
Hofmann, K., Behr, F., Radlinski, F.: On caption bias in interleaving experiments. In: CIKM 2012 (2012)
Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring preferences from clicks. In: CIKM 2011, pp. 249–258 (2011)
Hofmann, K., Whiteson, S., de Rijke, M.: Estimating interleaved comparison outcomes from historical click data. In: CIKM 2012 (2012)
Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142 (2002)
Joachims, T.: Unbiased evaluation of retrieval quality using clickthrough data. In: SIGIR Wksp. on Mathematical/Formal Methods in Information Retrieval (2002)
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOISÂ 25(2) (2007)
Li, J., Huffman, S., Tokuda, A.: Good abandonment in mobile and pc internet search. In: SIGIR 2009, pp. 43–50 (2009)
Matthijs, N., Radlinski, F.: Personalizing web search using long term browsing history. In: WSDM 2011 (2011)
Radlinski, F., Bennett, P., Yilmaz, E.: Detecting duplicate web documents using clickthrough data. In: WSDM 2011 (2011)
Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: SIGIR 2010 (2010)
Radlinski, F., Joachims, T.: Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In: AAAI 2006, pp. 1406–1412 (2006)
Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008 (2008)
Sanderson, M.: Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4(4), 247–375 (2010)
TREC: the Text REtrieval Conference, http://trec.nist.gov/
Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. In: Digital Libraries and Electronic Publishing. MIT Press (2005)
Wang, K., Walker, T., Zheng, Z.: PSkip: Estimating relevance ranking quality from web search clickthrough data. In: KDD 2009 (2009)
Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., Joachims, T.: Learning more powerful test statistics for click-based retrieval evaluation. In: SIGIR 2010 (2010)
Yue, Y., Patel, R., Roehrig, H.: Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In: WWW 2010 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Radlinski, F., Hofmann, K. (2013). Practical Online Retrieval Evaluation. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_107
Download citation
DOI: https://doi.org/10.1007/978-3-642-36973-5_107
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)