research-article

A is for Adele: An Offline Evaluation Metric for Instant Search

Authors:
Negar Arabzadeh

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada

0000-0002-4411-7089
View Profile

,
Oleksandra Kmet

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada

0009-0006-3810-8510
View Profile

,
Ben Carterette

Spotify Research, New York, NY, USA

Spotify Research, New York, NY, USA

0000-0001-9538-047X
View Profile

,
Charles L.A. Clarke

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada

0000-0001-8178-9194
View Profile

,
Claudia Hauff

Spotify Research, New York, NY, USA

Spotify Research, New York, NY, USA

0000-0001-9879-6470
View Profile

,
Praveen Chandar

Spotify Research, New York, NY, USA

Spotify Research, New York, NY, USA

0009-0008-2199-2631
View Profile

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information RetrievalAugust 2023Pages 3–12https://doi.org/10.1145/3578337.3605115

Published:09 August 2023Publication History

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 3–12

ABSTRACT

Instant search has emerged as the dominant search paradigm in entity-focused search applications, including search on Apple Music, Kayak, LinkedIn, and Spotify. Unlike the traditional search paradigm, in which users fully issue their query and then the system performs a retrieval round, instant search delivers a new result page with every keystroke. Despite the increasing prevalence of instant search, evaluation methodologies for instant search have not been fully developed and validated. As a result, we have no established evaluation metrics to measure improvements to instant search, and instant search systems still share offline evaluation metrics with traditional search systems. In this work, we first highlight critical differences between traditional search and instant search from an evaluation perspective. We then consider the difficulties of employing offline evaluation metrics designed for the traditional search paradigm to assess the effectiveness of instant search. Finally, we propose a new offline evaluation metric based on the unique characteristics of instant search. To demonstrate the utility of our metric, we conduct experiments across two very different platforms employing instant search: A commercial audio streaming platform and Wikipedia.

References

Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles LA Clarke. 2022. Shallow pooling for sparse labels. Information Retrieval Journal 25, 4 (2022), 365--385.Google ScholarDigital Library
Leif Azzopardi, Joel Mackenzie, and Alistair Moffat. 2021. ERR is not C/W/L: Exploring the relationship between expected reciprocal rank and other metrics. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 231--237.Google ScholarDigital Library
Krisztian Balog, Pavel Serdyukov, and Arjen P. de Vries. 2011. Overview of the TREC 2011 Entity Track. In 20th Text REtrieval Conference. Gaithersburg, Maryland.Google Scholar
Fei Cai, Maarten De Rijke, et al . 2016. A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273--363.Google Scholar
Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in information retrieval. 903--912.Google ScholarDigital Library
Praveen Chandar, Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, and Jennifer Thom. 2019. Developing evaluation metrics for instant search using mixed methods methods. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 925--928.Google ScholarDigital Library
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630.Google ScholarDigital Library
Charles LA Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining. 75--84.Google ScholarDigital Library
Charles LA Clarke, Fernando Diaz, and Negar Arabzadeh. 2023. Preference-Based Offline Evaluation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1248--1251.Google Scholar
William S Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American documentation 19, 1 (1968), 30--41.Google Scholar
Giovanni Di Santo, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2015. Comparing Approaches for Query Autocompletion. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 775--778.Google ScholarDigital Library
Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th international conference on World Wide Web. 581--590.Google ScholarDigital Library
Guglielmo Faggioli, Marco Ferrante, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2021. Hierarchical dependence-aware evaluation measures for conversational search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1935--1939.Google ScholarDigital Library
Jacek Gwizdka. 2010. Distribution of cognitive load in web search. Journal of the American Society for Information Science and Technology 61, 11 (2010), 2167--2187.Google ScholarCross Ref
Helia Hashemi, Aasish Pappu, Mi Tian, Praveen Chandar, Mounia Lalmas, and Benjamin Carterette. 2021. Neural instant search for music and podcast. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2984--2992.Google ScholarDigital Library
Enamul Hoque, Orland Hoeber, and Minglun Gong. 2011. Evaluating the trade-offs between diversity and precision for Web image search using concept-based query expansion. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 130--133.Google ScholarDigital Library
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.Google ScholarDigital Library
Kalervo Järvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4--15.Google ScholarCross Ref
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating Multi-Query Sessions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1053--1062.Google ScholarDigital Library
Madian Khabsa, Aidan Crook, Ahmed Hassan Awadallah, Imed Zitouni, Tasos Anastasakos, and Kyle Williams. 2016. Learning to account for good abandonment in search success metrics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1893--1896.Google ScholarDigital Library
Sudarshan Lamkhede and Sudeep Das. 2019. Challenges in search on streaming services: Netflix case study. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1371--1374.Google ScholarDigital Library
Liangda Li, Hongbo Deng, and Yi Chang. Query Auto-Completion. Springer, Cham, 145--170.Google Scholar
Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, Hongyuan Zha, and Ricardo Baeza-Yates. 2015. Analyzing user's sequential behavior in query auto-completion via markov processes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 123--132.Google ScholarDigital Library
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2021. How am I doing?: Evaluating conversational search systems offline. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--22.Google ScholarDigital Library
Zeyang Liu, Ke Zhou, and Max L Wilson. 2021. Meta-evaluation of conversational search evaluation metrics. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--42.Google ScholarDigital Library
Chuan Meng, Negar Arabzadeh, Mohammad Aliannejadi, and Maarten de Rijke. 2023. Query Performance Prediction: From Ad-hoc to Conversational Search. arXiv preprint arXiv:2305.10923 (2023).Google Scholar
Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In Proceedings of the 24th ACM international on conference on information and knowledge management. 1755--1758.Google ScholarDigital Library
Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems (TOIS) 35, 3 (2017), 1--38.Google ScholarDigital Library
Alistair Moffat, Joel Mackenzie, Paul Thomas, and Leif Azzopardi. 2022. A flexible framework for offline effectiveness metrics. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 578--587.Google ScholarDigital Library
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 659--668.Google ScholarDigital Library
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (December 2008), 2:1--2:27.Google ScholarDigital Library
Heather L O'Brien and Elaine G Toms. 2008. What is user engagement? A conceptual framework for defining user engagement with technology. Journal of the American society for Information Science and Technology 59, 6 (2008), 938--955.Google ScholarDigital Library
Tetsuya Sakai and Zhaohao Zeng. 2020. Retrieval evaluation measures that agree with users' SERP preferences: Traditional, preference-based, and diversity measures. ACM Transactions on Information Systems (TOIS) 39, 2 (2020), 1--35.Google ScholarDigital Library
Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 103--112.Google ScholarDigital Library
Jaime Teevan, Susan T Dumais, and Daniel J Liebling. 2008. To personalize or not to personalize: modeling queries with variation in user intent. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 163--170.Google ScholarDigital Library
Ganesh Venkataraman, Abhimanyu Lad, Lin Guo, and Shakti Sinha. 2016. Fast, lenient and accurate: Building personalized instant search experience at linkedin. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1502--1511.Google ScholarCross Ref
Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. 2019. Grid-based evaluation metrics for web image search. In The world wide web conference. 2103--2114.Google Scholar
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1561--1564.Google ScholarDigital Library
Fan Zhang, Jiaxin Mao, Yiqun Liu, Xiaohui Xie, Weizhi Ma, Min Zhang, and Shaoping Ma. 2020. Models versus satisfaction: Towards a better understanding of evaluation metrics. In Proceedings of the 43rd international acm sigir conference on research and development in information retrieval. 379--388.Google ScholarDigital Library
Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 615--624.Google ScholarDigital Library
Ke Zhou, Hongyuan Zha, Yi Chang, and Gui-Rong Xue. 2012. Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering 26, 2 (2012), 391--404.Google ScholarDigital Library

Index Terms

A is for Adele: An Offline Evaluation Metric for Instant Search
1. Information systems
  1. Information retrieval

Recommendations

Developing Evaluation Metrics for Instant Search Using Mixed Methods Methods
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Instant search has become a popular search paradigm in which users are shown a new result page in response to every keystroke triggered. Over recent years, the paradigm has been widely adopted in several domains including personal email search, e-...
Read More
Neural Instant Search for Music and Podcast
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Over recent years, podcasts have emerged as a novel medium for sharing and broadcasting information over the Internet. Audio streaming platforms originally designed for music content, such as Amazon Music, Pandora, and Spotify, have reported a rapid ...
Read More
Meta-evaluation of Online and Offline Web Search Evaluation Metrics
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval
August 2023
300 pages
ISBN:9798400700736
DOI:10.1145/3578337
General Chair:
Masaharu Yoshioka
Hokkaido University, Japan
,
Program Chairs:
Julia Kiseleva
Microsoft Research, USA
,
Mohammad Aliannejadi
University of Amsterdam, Netherlands
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
instant search
retrieval effectiveness
Qualifiers
- research-article
Conference

Acceptance Rates
ICTIR '23 Paper Acceptance Rate30of73submissions,41%Overall Acceptance Rate209of482submissions,43%
More
Upcoming Conference
ICTIR '24

Sponsor:

sigir

The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC , DC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 73
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A is for Adele: An Offline Evaluation Metric for Instant Search

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Developing Evaluation Metrics for Instant Search Using Mixed Methods Methods

Neural Instant Search for Music and Podcast

Meta-evaluation of Online and Offline Web Search Evaluation Metrics