abstract

Mixed Method Development of Evaluation Metrics

Authors:
Brian St. Thomas

Spotify, Boston, MA, USA

Spotify, Boston, MA, USA
View Profile

,
Praveen Chandar

Spotify, New York City, NY, USA

Spotify, New York City, NY, USA
View Profile

,
Christine Hosey

Spotify, Boston, MA, USA

Spotify, Boston, MA, USA
View Profile

,
Fernando Diaz

Google, Montreal, PQ, Canada

Google, Montreal, PQ, Canada
View Profile

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningAugust 2021Pages 4070–4071https://doi.org/10.1145/3447548.3470802

Published:14 August 2021Publication History

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 4070–4071

ABSTRACT

Designers of online search and recommendation services often need to develop metrics to assess system performance. This tutorial focuses on mixed methods approaches to developing user-focused evaluation metrics. This starts with choosing how data is logged or how to interpret current logged data, with a discussion of how qualitative insights and design decisions can restrict or enable certain types of logging. When we create a metric from that logged data, there are underlying assumptions about how users interact with the system and evaluate those interactions. We will cover what these assumptions look like for some traditional system evaluation metrics and highlight quantitative and qualitative methods that investigate and adapt these assumptions to be more explicit and expressive of genuine user behavior. We discuss the role that mixed methods teams can play at each stage of metric development, starting with data collection, designing both online and offline metrics, and supervising metric selection for decision making. We describe case studies and examples of these methods applied in the context of evaluating personalized search and recommendation systems. Finally, we close with practical advice for applied quantitative researchers who may be in the early stages of planning collaborations with qualitative researchers for mixed methods metrics development.

References

Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The Relationship between IR Effectiveness Measures and User Satisfaction. In SIGIR .Google Scholar
Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2019. The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely . Technical Report 26463. National Bureau of Economic Research.Google Scholar
Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, Ashwin Satyanarayana, and S.M.M. Tahaghoghi. 2010. Evaluating Whole-Page Relevance. In SIGIR .Google Scholar
Roc'io Ca namares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal (2020).Google Scholar
Ben Carterette and Rosie Jones. 2007. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In NIPS .Google Scholar
Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2012. Incorporating variability in user behavior into systems based evaluation. In CIKM .Google Scholar
Praveen Chandar, Fernando Diaz, and Brian St. Thomas. 2020. Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems. In Advances in Neural Information Processing Systems .Google Scholar
Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click Model-based Information Retrieval Metrics. In SIGIR .Google Scholar
Charles L.A. Clarke, Mark D. Smucker, and Emine Yilmaz. 2015. IR Evaluation: Modeling User Behavior for Measuring Effectiveness. In SIGIR .Google Scholar
William S. Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation , Vol. 19, 1 (1968), 30--41.Google ScholarCross Ref
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In RecSys .Google Scholar
Georges Dupret and Mounia Lalmas. 2013. Absence time and user engagement: evaluating ranking functions. In Proceedings of the sixth ACM international conference on Web search and data mining . 173--182.Google ScholarDigital Library
Georges Dupret, Vanessa Murdock, and Benjamin Piwowarski. 2007. Web Search Engine Evaluation using Clickthrough Data and a User Model. In WWW 2007 Workshop on Query Log Analysis: Social And Technological Challenges , , Einat Amitay, Craig G. Murray, and Jaime Teevan (Eds.).Google Scholar
Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, Ben Carterette, and Fernando Diaz. 2018a. Mixed Methods for Evaluating User Satisfaction. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys '18). Association for Computing Machinery, New York, NY, USA, 541--542. https://doi.org/10.1145/3240323.3241622Google ScholarDigital Library
Jean Garcia-Gathright, Brian St. Thomas, Christine Hosey, Zahra Nazari, and Fernando Diaz. 2018b. Understanding and Evaluating User Satisfaction with Music Discovery. In SIGIR .Google Scholar
Qi Guo and Eugene Agichtein. 2012. Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior. In WWW .Google Scholar
Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: user behavior as a predictor of a successful search. In WSDM .Google Scholar
Ahmed Hassan, Ryen W. White, Susan T. Dumais, and Yi-Min Wang. 2014. Struggling or Exploring?: Disambiguating Long Search Sessions. In WSDM .Google Scholar
Henning Hohnhold, Deirdre O'Brien, and Diane Tang. 2015. Focusing on the long-term: It's good for users and business. In Proc. of KDD . 1849--1858.Google ScholarDigital Library
Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. TOIS , Vol. 20, 4 (2002), 422--446.Google Scholar
Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and Predicting Graded Search Satisfaction. In WSDM .Google Scholar
Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI .Google ScholarDigital Library
Hendrik Müller and Aaron Sedley. 2014. HaTS: Large-scale In-product Measurement of User Attitudes & Experiences with Happiness Tracking Surveys. In Proceedings of the 26th Australian Computer-Human Interaction Conference (OzCHI 2014). New York, NY, USA, 308--315. http://dx.doi.org/10.1145/2686612.2686656Google ScholarDigital Library
Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online retrieval evaluation. In WSDM .Google Scholar
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In CIKM .Google Scholar
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In SIGIR .Google Scholar
Mark D. Smucker and Charles L.A. Clarke. 2012. Time-based calibration of effectiveness measures. In SIGIR .Google Scholar
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected Browsing Utility for Web Search Evaluation. In CIKM .Google Scholar

Index Terms

Mixed Method Development of Evaluation Metrics
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User models
      2. User studies
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
    2. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Mixed methods for evaluating user satisfaction
RecSys '18: Proceedings of the 12th ACM Conference on Recommender Systems

Evaluation is a fundamental part of a recommendation system. Evaluation typically takes one of three forms: (1) smaller lab studies with real users; (2) batch tests with offline collections, judgements, and measures; (3) large-scale controlled ...
Read More
Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it ...
Read More
Adoption of object-oriented software metrics for ontology evaluation
BCI '12: Proceedings of the Fifth Balkan Conference in Informatics

Object-oriented software metrics are well established and widely acknowledged as a measure of software quality. The aim of our research is to analyze the potential use of some of these metrics for ontology evaluation. In this paper we present the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
August 2021
4259 pages
ISBN:9781450383325
DOI:10.1145/3447548
General Chairs:
Feida Zhu
Singapore Management University
,
Beng Chin Ooi
National University of Singapore
,
Chunyan Miao
Nanyang Technology University
,
Program Chairs:
Haixun Wang,
Iryna Skrypnyk,
Wynne Hsu,
Sanjay Chawla
Copyright © 2021 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 August 2021
Check for updates
Author Tags
evaluation
mixed methods
online metrics
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 151
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mixed Method Development of Evaluation Metrics

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mixed methods for evaluating user satisfaction

Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned

Adoption of object-oriented software metrics for ontology evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mixed Method Development of Evaluation Metrics

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mixed methods for evaluating user satisfaction

Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned

Adoption of object-oriented software metrics for ontology evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media