skip to main content
10.1145/3447548.3470802acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Mixed Method Development of Evaluation Metrics

Published:14 August 2021Publication History

ABSTRACT

Designers of online search and recommendation services often need to develop metrics to assess system performance. This tutorial focuses on mixed methods approaches to developing user-focused evaluation metrics. This starts with choosing how data is logged or how to interpret current logged data, with a discussion of how qualitative insights and design decisions can restrict or enable certain types of logging. When we create a metric from that logged data, there are underlying assumptions about how users interact with the system and evaluate those interactions. We will cover what these assumptions look like for some traditional system evaluation metrics and highlight quantitative and qualitative methods that investigate and adapt these assumptions to be more explicit and expressive of genuine user behavior. We discuss the role that mixed methods teams can play at each stage of metric development, starting with data collection, designing both online and offline metrics, and supervising metric selection for decision making. We describe case studies and examples of these methods applied in the context of evaluating personalized search and recommendation systems. Finally, we close with practical advice for applied quantitative researchers who may be in the early stages of planning collaborations with qualitative researchers for mixed methods metrics development.

References

  1. Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The Relationship between IR Effectiveness Measures and User Satisfaction. In SIGIR .Google ScholarGoogle Scholar
  2. Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2019. The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely . Technical Report 26463. National Bureau of Economic Research.Google ScholarGoogle Scholar
  3. Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, Ashwin Satyanarayana, and S.M.M. Tahaghoghi. 2010. Evaluating Whole-Page Relevance. In SIGIR .Google ScholarGoogle Scholar
  4. Roc'io Ca namares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal (2020).Google ScholarGoogle Scholar
  5. Ben Carterette and Rosie Jones. 2007. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In NIPS .Google ScholarGoogle Scholar
  6. Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2012. Incorporating variability in user behavior into systems based evaluation. In CIKM .Google ScholarGoogle Scholar
  7. Praveen Chandar, Fernando Diaz, and Brian St. Thomas. 2020. Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems. In Advances in Neural Information Processing Systems .Google ScholarGoogle Scholar
  8. Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click Model-based Information Retrieval Metrics. In SIGIR .Google ScholarGoogle Scholar
  9. Charles L.A. Clarke, Mark D. Smucker, and Emine Yilmaz. 2015. IR Evaluation: Modeling User Behavior for Measuring Effectiveness. In SIGIR .Google ScholarGoogle Scholar
  10. William S. Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation , Vol. 19, 1 (1968), 30--41.Google ScholarGoogle ScholarCross RefCross Ref
  11. Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In RecSys .Google ScholarGoogle Scholar
  12. Georges Dupret and Mounia Lalmas. 2013. Absence time and user engagement: evaluating ranking functions. In Proceedings of the sixth ACM international conference on Web search and data mining . 173--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Georges Dupret, Vanessa Murdock, and Benjamin Piwowarski. 2007. Web Search Engine Evaluation using Clickthrough Data and a User Model. In WWW 2007 Workshop on Query Log Analysis: Social And Technological Challenges , , Einat Amitay, Craig G. Murray, and Jaime Teevan (Eds.).Google ScholarGoogle Scholar
  14. Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, Ben Carterette, and Fernando Diaz. 2018a. Mixed Methods for Evaluating User Satisfaction. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys '18). Association for Computing Machinery, New York, NY, USA, 541--542. https://doi.org/10.1145/3240323.3241622Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jean Garcia-Gathright, Brian St. Thomas, Christine Hosey, Zahra Nazari, and Fernando Diaz. 2018b. Understanding and Evaluating User Satisfaction with Music Discovery. In SIGIR .Google ScholarGoogle Scholar
  16. Qi Guo and Eugene Agichtein. 2012. Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior. In WWW .Google ScholarGoogle Scholar
  17. Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user behavior as a predictor of a successful search. In WSDM .Google ScholarGoogle Scholar
  18. Ahmed Hassan, Ryen W. White, Susan T. Dumais, and Yi-Min Wang. 2014. Struggling or Exploring?: Disambiguating Long Search Sessions. In WSDM .Google ScholarGoogle Scholar
  19. Henning Hohnhold, Deirdre O'Brien, and Diane Tang. 2015. Focusing on the long-term: It's good for users and business. In Proc. of KDD . 1849--1858.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. TOIS , Vol. 20, 4 (2002), 422--446.Google ScholarGoogle Scholar
  21. Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and Predicting Graded Search Satisfaction. In WSDM .Google ScholarGoogle Scholar
  22. Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI .Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hendrik Müller and Aaron Sedley. 2014. HaTS: Large-scale In-product Measurement of User Attitudes & Experiences with Happiness Tracking Surveys. In Proceedings of the 26th Australian Computer-Human Interaction Conference (OzCHI 2014). New York, NY, USA, 308--315. http://dx.doi.org/10.1145/2686612.2686656Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online retrieval evaluation. In WSDM .Google ScholarGoogle Scholar
  25. Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In CIKM .Google ScholarGoogle Scholar
  26. Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In SIGIR .Google ScholarGoogle Scholar
  27. Mark D. Smucker and Charles L.A. Clarke. 2012. Time-based calibration of effectiveness measures. In SIGIR .Google ScholarGoogle Scholar
  28. Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected Browsing Utility for Web Search Evaluation. In CIKM .Google ScholarGoogle Scholar

Index Terms

  1. Mixed Method Development of Evaluation Metrics

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
            August 2021
            4259 pages
            ISBN:9781450383325
            DOI:10.1145/3447548

            Copyright © 2021 Owner/Author

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 August 2021

            Check for updates

            Qualifiers

            • abstract

            Acceptance Rates

            Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader