ABSTRACT
Instant search has emerged as the dominant search paradigm in entity-focused search applications, including search on Apple Music, Kayak, LinkedIn, and Spotify. Unlike the traditional search paradigm, in which users fully issue their query and then the system performs a retrieval round, instant search delivers a new result page with every keystroke. Despite the increasing prevalence of instant search, evaluation methodologies for instant search have not been fully developed and validated. As a result, we have no established evaluation metrics to measure improvements to instant search, and instant search systems still share offline evaluation metrics with traditional search systems. In this work, we first highlight critical differences between traditional search and instant search from an evaluation perspective. We then consider the difficulties of employing offline evaluation metrics designed for the traditional search paradigm to assess the effectiveness of instant search. Finally, we propose a new offline evaluation metric based on the unique characteristics of instant search. To demonstrate the utility of our metric, we conduct experiments across two very different platforms employing instant search: A commercial audio streaming platform and Wikipedia.
- Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles LA Clarke. 2022. Shallow pooling for sparse labels. Information Retrieval Journal 25, 4 (2022), 365--385.Google ScholarDigital Library
- Leif Azzopardi, Joel Mackenzie, and Alistair Moffat. 2021. ERR is not C/W/L: Exploring the relationship between expected reciprocal rank and other metrics. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 231--237.Google ScholarDigital Library
- Krisztian Balog, Pavel Serdyukov, and Arjen P. de Vries. 2011. Overview of the TREC 2011 Entity Track. In 20th Text REtrieval Conference. Gaithersburg, Maryland.Google Scholar
- Fei Cai, Maarten De Rijke, et al . 2016. A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273--363.Google Scholar
- Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in information retrieval. 903--912.Google ScholarDigital Library
- Praveen Chandar, Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, and Jennifer Thom. 2019. Developing evaluation metrics for instant search using mixed methods methods. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 925--928.Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630.Google ScholarDigital Library
- Charles LA Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining. 75--84.Google ScholarDigital Library
- Charles LA Clarke, Fernando Diaz, and Negar Arabzadeh. 2023. Preference-Based Offline Evaluation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1248--1251.Google Scholar
- William S Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American documentation 19, 1 (1968), 30--41.Google Scholar
- Giovanni Di Santo, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2015. Comparing Approaches for Query Autocompletion. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 775--778.Google ScholarDigital Library
- Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th international conference on World Wide Web. 581--590.Google ScholarDigital Library
- Guglielmo Faggioli, Marco Ferrante, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2021. Hierarchical dependence-aware evaluation measures for conversational search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1935--1939.Google ScholarDigital Library
- Jacek Gwizdka. 2010. Distribution of cognitive load in web search. Journal of the American Society for Information Science and Technology 61, 11 (2010), 2167--2187.Google ScholarCross Ref
- Helia Hashemi, Aasish Pappu, Mi Tian, Praveen Chandar, Mounia Lalmas, and Benjamin Carterette. 2021. Neural instant search for music and podcast. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2984--2992.Google ScholarDigital Library
- Enamul Hoque, Orland Hoeber, and Minglun Gong. 2011. Evaluating the trade-offs between diversity and precision for Web image search using concept-based query expansion. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 130--133.Google ScholarDigital Library
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.Google ScholarDigital Library
- Kalervo Järvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4--15.Google ScholarCross Ref
- Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating Multi-Query Sessions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1053--1062.Google ScholarDigital Library
- Madian Khabsa, Aidan Crook, Ahmed Hassan Awadallah, Imed Zitouni, Tasos Anastasakos, and Kyle Williams. 2016. Learning to account for good abandonment in search success metrics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1893--1896.Google ScholarDigital Library
- Sudarshan Lamkhede and Sudeep Das. 2019. Challenges in search on streaming services: Netflix case study. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1371--1374.Google ScholarDigital Library
- Liangda Li, Hongbo Deng, and Yi Chang. Query Auto-Completion. Springer, Cham, 145--170.Google Scholar
- Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, Hongyuan Zha, and Ricardo Baeza-Yates. 2015. Analyzing user's sequential behavior in query auto-completion via markov processes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 123--132.Google ScholarDigital Library
- Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2021. How am I doing?: Evaluating conversational search systems offline. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--22.Google ScholarDigital Library
- Zeyang Liu, Ke Zhou, and Max L Wilson. 2021. Meta-evaluation of conversational search evaluation metrics. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--42.Google ScholarDigital Library
- Chuan Meng, Negar Arabzadeh, Mohammad Aliannejadi, and Maarten de Rijke. 2023. Query Performance Prediction: From Ad-hoc to Conversational Search. arXiv preprint arXiv:2305.10923 (2023).Google Scholar
- Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In Proceedings of the 24th ACM international on conference on information and knowledge management. 1755--1758.Google ScholarDigital Library
- Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems (TOIS) 35, 3 (2017), 1--38.Google ScholarDigital Library
- Alistair Moffat, Joel Mackenzie, Paul Thomas, and Leif Azzopardi. 2022. A flexible framework for offline effectiveness metrics. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 578--587.Google ScholarDigital Library
- Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 659--668.Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (December 2008), 2:1--2:27.Google ScholarDigital Library
- Heather L O'Brien and Elaine G Toms. 2008. What is user engagement? A conceptual framework for defining user engagement with technology. Journal of the American society for Information Science and Technology 59, 6 (2008), 938--955.Google ScholarDigital Library
- Tetsuya Sakai and Zhaohao Zeng. 2020. Retrieval evaluation measures that agree with users' SERP preferences: Traditional, preference-based, and diversity measures. ACM Transactions on Information Systems (TOIS) 39, 2 (2020), 1--35.Google ScholarDigital Library
- Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 103--112.Google ScholarDigital Library
- Jaime Teevan, Susan T Dumais, and Daniel J Liebling. 2008. To personalize or not to personalize: modeling queries with variation in user intent. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 163--170.Google ScholarDigital Library
- Ganesh Venkataraman, Abhimanyu Lad, Lin Guo, and Shakti Sinha. 2016. Fast, lenient and accurate: Building personalized instant search experience at linkedin. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1502--1511.Google ScholarCross Ref
- Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. 2019. Grid-based evaluation metrics for web image search. In The world wide web conference. 2103--2114.Google Scholar
- Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1561--1564.Google ScholarDigital Library
- Fan Zhang, Jiaxin Mao, Yiqun Liu, Xiaohui Xie, Weizhi Ma, Min Zhang, and Shaoping Ma. 2020. Models versus satisfaction: Towards a better understanding of evaluation metrics. In Proceedings of the 43rd international acm sigir conference on research and development in information retrieval. 379--388.Google ScholarDigital Library
- Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 615--624.Google ScholarDigital Library
- Ke Zhou, Hongyuan Zha, Yi Chang, and Gui-Rong Xue. 2012. Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering 26, 2 (2012), 391--404.Google ScholarDigital Library
Index Terms
- A is for Adele: An Offline Evaluation Metric for Instant Search
Recommendations
Developing Evaluation Metrics for Instant Search Using Mixed Methods Methods
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalInstant search has become a popular search paradigm in which users are shown a new result page in response to every keystroke triggered. Over recent years, the paradigm has been widely adopted in several domains including personal email search, e-...
Neural Instant Search for Music and Podcast
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningOver recent years, podcasts have emerged as a novel medium for sharing and broadcasting information over the Internet. Audio streaming platforms originally designed for music content, such as Amazon Music, Pandora, and Spotify, have reported a rapid ...
Meta-evaluation of Online and Offline Web Search Evaluation Metrics
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalAs in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on ...
Comments