tutorial

Effective Online Evaluation for Web Search

Authors:

Eugene Kharitonov,

Denis Kulemyakin,

Pavel Serdyukov,

Igor YashkovAuthors Info & Claims

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1399 - 1400

https://doi.org/10.1145/3331184.3331378

Published: 18 July 2019 Publication History

Abstract

We present you a program of a balanced mix between an overview of academic achievements in the field of online evaluation and a portion of unique industrial practical experience shared by both the leading researchers and engineers from global Internet companies. First, we give basic knowledge from mathematical statistics. This is followed by foundations of main evaluation methods such as A/B testing, interleaving, and observational studies. Then, we share rich industrial experiences on constructing of an experimentation pipeline and evaluation metrics (emphasizing best practices and common pitfalls). A large part of our tutorial is devoted to modern and state-of-the-art techniques (including the ones based on machine learning) that allow to conduct online experimentation efficiently. We invite software engineers, designers, analysts, and managers of web services and software products, as well as beginners, advanced specialists, and researchers to learn how to make web service development effectively data-driven.

References

[1]

Vineet Abhishek and Shie Mannor. 2017. A nonparametric sequential test for online randomized experiments. In WWW'2017 Companion. 610--616.

Digital Library

[2]

Olga Arkhipova, Lidia Grauer, Igor Kuralenok, and Pavel Serdyukov. 2015. Search Engine Evaluation based on Search Engine Switching Prediction. In SIGIR'2015. 723--726.

Digital Library

[3]

Susan Athey and Guido Imbens. 2015. Machine Learning Methods for Estimating Heterogeneous Causal Effects. arXiv preprint arXiv:1504.01132 (2015).

[4]

Juliette Aurisset, Michael Ramm, and Joshua Parks. 2017. Innovating Faster on Personalization Algorithms at Netflix Using Interleaving. https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55.

[5]

Eytan Bakshy and Dean Eckles. 2013. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD'2013. 1303--1311.

Digital Library

[6]

Roman Budylin, Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Pavel Serdyukov, and Igor Yashkov. 2018a. Online Evaluation for Effective Web Service Development: Extended Abstract of the Tutorial at TheWebConf'2018.

[7]

Roman Budylin, Alexey Drutsa, Gleb Gusev, Pavel Serdyukov, and Igor Yashkov. 2018b. Online evaluation for effective web service development. In arXiv preprint arXiv:1809.00661. Tutorial at KDD'2018.

[8]

Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018c. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 55--63.

Digital Library

[9]

Sunandan Chakraborty, Filip Radlinski, Milad Shokouhi, and Paul Baecke. 2014. On correlation of absence time and search effectiveness. In SIGIR'2014. 1163--1166.

Digital Library

[10]

O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. TOIS, Vol. 30, 1 (2012), 6.

Digital Library

[11]

Shuchi Chawla, Jason Hartline, and Denis Nekipelov. 2016. A/B testing of auctions. In EC'2016.

Digital Library

[12]

Dominic Coey and Michael Bailey. 2016. People and cookies: Imperfect treatment assignment in online experiments. In WWW'2016. 1103--1111.

Digital Library

[13]

Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web. In KDD'2009.

Digital Library

[14]

Alex Deng. 2015. Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments. In WWW'2015 Companion. 923--928.

Digital Library

[15]

Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments. In WSDM'2015. 349--358.

Digital Library

[16]

Alex Deng, Tianxi Li, and Yu Guo. 2014. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In WWW'2014.

Digital Library

[17]

Alex Deng, Jiannan Lu, and Shouyuan Chen. 2016. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing. In DSAA'2016.

[18]

Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In KDD'2016.

Digital Library

[19]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In WSDM'2013.

Digital Library

[20]

Pavel Dmitriev, Somit Gupta, Ron Kohavi, Alex Deng, Paul Raff, and Lukas Vermeer. 2017. A/B Testing at Scale. https://exp-platform.com/2017abtestingtutorial/

[21]

Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In CIKM'2016. 429--437.

Digital Library

[22]

Alexey Drutsa. 2015. Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation. In SIGIR'2015. 779--782.

Digital Library

[23]

Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015a. Engagement Periodicity in Search Engine Usage: Analysis and Its Application to Search Quality Evaluation. In WSDM'2015. 27--36.

Digital Library

[24]

Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015b. Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments. In WWW'2015. 256--266.

Digital Library

[25]

Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017a. Periodicity in User Engagement with a Search Engine and its Application to Online Controlled Experiments. ACM Transactions on the Web (TWEB), Vol. 11 (2017).

Digital Library

[26]

Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017b. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments. In WWW'2017.

Digital Library

[27]

Alexey Drutsa, Anna Ufliand, and Gleb Gusev. 2015c. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics. In CIKM'2015. 763--772.

Digital Library

[28]

Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap .CRC press.

[29]

Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for information retrieval: Tutorial. In SIGIR.

Digital Library

[30]

Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network a/b testing: From sampling to estimation. In WWW'2015. 399--409.

Digital Library

[31]

Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. 2011. A probabilistic method for inferring preferences from clicks. In CIKM'2011. 249--258.

Digital Library

[32]

Henning Hohnhold, Deirdre O'Brien, and Diane Tang. 2015. Focusing on the Long-term: It's Good for Users and Business. In KDD'2015. 1849--1858.

Digital Library

[33]

Thorsten Joachims. 2002. Unbiased evaluation of retrieval quality using clickthrough data. (2002).

[34]

Thorsten Joachims et al. 2003. Evaluating Retrieval Performance Using Clickthrough Data.

[35]

Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at A/B Tests: Why it matters, and what to do about it. In KDD'2017. 1517--1525.

Digital Library

[36]

Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In CIKM. 699--708.

Digital Library

[37]

Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive Combinations of A/B Test Metrics. In WSDM'2017.

Digital Library

[38]

Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015a. Generalized Team Draft Interleaving. In CIKM'2015.

Digital Library

[39]

Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015b. Optimised Scheduling of Online Experiments. In SIGIR'2015. 453--462.

Digital Library

[40]

Eugene Kharitonov, Aleksandr Vorobev, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015c. Sequential Testing for Early Stopping of Online Experiments. In SIGIR'2015. 473--482.

Digital Library

[41]

Youngho Kim, Ahmed Hassan, Ryen W White, and Imed Zitouni. 2014. Modeling dwell time to predict click-level satisfaction. In WSDM'2014. 193--202.

Digital Library

[42]

Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. 2009a. Online experimentation at Microsoft. Data Mining Case Studies (2009), 11.

[43]

Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD'2012. 786--794.

Digital Library

[44]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In KDD'2013. 1168--1176.

Digital Library

[45]

R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. 2014. Seven Rules of Thumb for Web Site Experimenters. In KDD'2014.

Digital Library

[46]

Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In KDD'2007. 959--967.

Digital Library

[47]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009b. Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov., Vol. 18, 1 (2009), 140--181.

Digital Library

[48]

Ron Kohavi, David Messner, Seth Eliot, Juan Lavista Ferres, Randy Henne, Vignesh Kannappan, and Justin Wang. 2010. Tracking Users' Clicks and Submits: Tradeoffs between User Experience and Data Loss.

[49]

Mounia Lalmas, Heather O'Brien, and Elad Yom-Tov. 2014. Measuring user engagement. SLICRS, Vol. 6, 4 (2014), 1--132.

Digital Library

[50]

Ilya Markov and Maarten de Rijke. 2019. What Should We Teach in Information Retrieval?. In ACM SIGIR Forum, Vol. 52. 19--39.

Digital Library

[51]

Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb Gusev, and Pavel Serdyukov. 2015. Extreme States Distribution Decomposition Method for Search Engine Online Evaluation. In KDD'2015. 845--854.

Digital Library

[52]

Eric T Peterson. 2004. Web analytics demystified: a marketer's guide to understanding how your web site affects your business .Ingram.

[53]

Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. 2017. Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102 (2017).

[54]

Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In KDD'2016. 235--244.

Digital Library

[55]

Filip Radlinski. 2013. Sensitive Online Search Evaluation. http://irsg.bcs.org/SearchSolutions/2013/presentations/radlinski.pdf.

[56]

Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online retrieval evaluation. In WSDM.

Digital Library

[57]

Filip Radlinski and Katja Hofmann. 2013. Practical online retrieval evaluation. In ECIR.

Digital Library

[58]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In CIKM'2008. 43--52.

Digital Library

[59]

Filip Radlinski and Yisong Yue. 2011. Practical Online Retrieval Evaluation. In SIGIR.

Digital Library

[60]

Kerry Rodden, Hilary Hutchinson, and Xin Fu. 2010. Measuring the user experience on a large scale: user-centered metrics for web applications. In CHI'2010.

Digital Library

[61]

Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. 2017. Detecting network effects: Randomizing over randomized experiments. In KDD'2017. 1027--1035.

Digital Library

[62]

Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved comparisons for fast online evaluation. In CIKM.

Digital Library

[63]

Milad Shokouhi. 2011. Detecting seasonal queries by time-series analysis. In SIGIR'2011. 1171--1172.

Digital Library

[64]

Yang Song, Xiaolin Shi, and Xin Fu. 2013. Evaluating and predicting user engagement change with degraded search relevance. In WWW'2013. 1213--1224.

Digital Library

[65]

Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In KDD.

Digital Library

[66]

Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. In KDD'2016.

Digital Library

[67]

Ya Xu and Nanyu Chen. 2016. Evaluating Mobile Apps with A/B and Quasi A/B Tests. In KDD'2016.

Digital Library

[68]

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In KDD'2015.

Digital Library

Cited By

Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Deng AYuan LKanai NSalama-Manteau AChua TLauw HSi LTerzi ETsaparas P(2023)Zero to Hero: Exploiting Null Effects to Achieve Variance Reduction in Experiments with One-sided TriggeringProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570413(823-831)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570413

Index Terms

Effective Online Evaluation for Web Search

Recommendations

Practical online retrieval evaluation
ECIR'13: Proceedings of the 35th European conference on Advances in Information Retrieval

Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation,...
Meta-evaluation of Online and Offline Web Search Evaluation Metrics
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on ...
Challenges and Opportunities in Online Evaluation of Search Engines
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Yandex is one of the largest Internet companies in Europe, operating Russia's most popular search engine, generating 58.6\% of all search traffic in Russia (as of April 2015). As all modern search engines, Yandex increasingly relies on online evaluation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2019

1512 pages

ISBN:9781450361729

DOI:10.1145/3331184

General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

SIGIR '19

Sponsor:

SIGIR

SIGIR '19: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 21 - 25, 2019

Paris, France

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
398
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Deng AYuan LKanai NSalama-Manteau AChua TLauw HSi LTerzi ETsaparas P(2023)Zero to Hero: Exploiting Null Effects to Achieve Variance Reduction in Experiments with One-sided TriggeringProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570413(823-831)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570413

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten