ABSTRACT
Designers of online search and recommendation services often need to develop metrics to assess system performance. This tutorial focuses on mixed methods approaches to developing user-focused evaluation metrics. This starts with choosing how data is logged or how to interpret current logged data, with a discussion of how qualitative insights and design decisions can restrict or enable certain types of logging. When we create a metric from that logged data, there are underlying assumptions about how users interact with the system and evaluate those interactions. We will cover what these assumptions look like for some traditional system evaluation metrics and highlight quantitative and qualitative methods that investigate and adapt these assumptions to be more explicit and expressive of genuine user behavior. We discuss the role that mixed methods teams can play at each stage of metric development, starting with data collection, designing both online and offline metrics, and supervising metric selection for decision making. We describe case studies and examples of these methods applied in the context of evaluating personalized search and recommendation systems. Finally, we close with practical advice for applied quantitative researchers who may be in the early stages of planning collaborations with qualitative researchers for mixed methods metrics development.
- Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The Relationship between IR Effectiveness Measures and User Satisfaction. In SIGIR .Google Scholar
- Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2019. The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely . Technical Report 26463. National Bureau of Economic Research.Google Scholar
- Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, Ashwin Satyanarayana, and S.M.M. Tahaghoghi. 2010. Evaluating Whole-Page Relevance. In SIGIR .Google Scholar
- Roc'io Ca namares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal (2020).Google Scholar
- Ben Carterette and Rosie Jones. 2007. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In NIPS .Google Scholar
- Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2012. Incorporating variability in user behavior into systems based evaluation. In CIKM .Google Scholar
- Praveen Chandar, Fernando Diaz, and Brian St. Thomas. 2020. Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems. In Advances in Neural Information Processing Systems .Google Scholar
- Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click Model-based Information Retrieval Metrics. In SIGIR .Google Scholar
- Charles L.A. Clarke, Mark D. Smucker, and Emine Yilmaz. 2015. IR Evaluation: Modeling User Behavior for Measuring Effectiveness. In SIGIR .Google Scholar
- William S. Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation , Vol. 19, 1 (1968), 30--41.Google ScholarCross Ref
- Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In RecSys .Google Scholar
- Georges Dupret and Mounia Lalmas. 2013. Absence time and user engagement: evaluating ranking functions. In Proceedings of the sixth ACM international conference on Web search and data mining . 173--182.Google ScholarDigital Library
- Georges Dupret, Vanessa Murdock, and Benjamin Piwowarski. 2007. Web Search Engine Evaluation using Clickthrough Data and a User Model. In WWW 2007 Workshop on Query Log Analysis: Social And Technological Challenges , , Einat Amitay, Craig G. Murray, and Jaime Teevan (Eds.).Google Scholar
- Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, Ben Carterette, and Fernando Diaz. 2018a. Mixed Methods for Evaluating User Satisfaction. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys '18). Association for Computing Machinery, New York, NY, USA, 541--542. https://doi.org/10.1145/3240323.3241622Google ScholarDigital Library
- Jean Garcia-Gathright, Brian St. Thomas, Christine Hosey, Zahra Nazari, and Fernando Diaz. 2018b. Understanding and Evaluating User Satisfaction with Music Discovery. In SIGIR .Google Scholar
- Qi Guo and Eugene Agichtein. 2012. Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior. In WWW .Google Scholar
- Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user behavior as a predictor of a successful search. In WSDM .Google Scholar
- Ahmed Hassan, Ryen W. White, Susan T. Dumais, and Yi-Min Wang. 2014. Struggling or Exploring?: Disambiguating Long Search Sessions. In WSDM .Google Scholar
- Henning Hohnhold, Deirdre O'Brien, and Diane Tang. 2015. Focusing on the long-term: It's good for users and business. In Proc. of KDD . 1849--1858.Google ScholarDigital Library
- Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. TOIS , Vol. 20, 4 (2002), 422--446.Google Scholar
- Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and Predicting Graded Search Satisfaction. In WSDM .Google Scholar
- Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI .Google ScholarDigital Library
- Hendrik Müller and Aaron Sedley. 2014. HaTS: Large-scale In-product Measurement of User Attitudes & Experiences with Happiness Tracking Surveys. In Proceedings of the 26th Australian Computer-Human Interaction Conference (OzCHI 2014). New York, NY, USA, 308--315. http://dx.doi.org/10.1145/2686612.2686656Google ScholarDigital Library
- Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online retrieval evaluation. In WSDM .Google Scholar
- Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In CIKM .Google Scholar
- Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In SIGIR .Google Scholar
- Mark D. Smucker and Charles L.A. Clarke. 2012. Time-based calibration of effectiveness measures. In SIGIR .Google Scholar
- Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected Browsing Utility for Web Search Evaluation. In CIKM .Google Scholar
Index Terms
- Mixed Method Development of Evaluation Metrics
Recommendations
Mixed methods for evaluating user satisfaction
RecSys '18: Proceedings of the 12th ACM Conference on Recommender SystemsEvaluation is a fundamental part of a recommendation system. Evaluation typically takes one of three forms: (1) smaller lab studies with real users; (2) batch tests with offline collections, judgements, and measures; (3) large-scale controlled ...
Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningOnline controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it ...
Adoption of object-oriented software metrics for ontology evaluation
BCI '12: Proceedings of the Fifth Balkan Conference in InformaticsObject-oriented software metrics are well established and widely acknowledged as a measure of software quality. The aim of our research is to analyze the potential use of some of these metrics for ontology evaluation. In this paper we present the ...
Comments