skip to main content
10.1145/3578337.3605115acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

A is for Adele: An Offline Evaluation Metric for Instant Search

Published:09 August 2023Publication History

ABSTRACT

Instant search has emerged as the dominant search paradigm in entity-focused search applications, including search on Apple Music, Kayak, LinkedIn, and Spotify. Unlike the traditional search paradigm, in which users fully issue their query and then the system performs a retrieval round, instant search delivers a new result page with every keystroke. Despite the increasing prevalence of instant search, evaluation methodologies for instant search have not been fully developed and validated. As a result, we have no established evaluation metrics to measure improvements to instant search, and instant search systems still share offline evaluation metrics with traditional search systems. In this work, we first highlight critical differences between traditional search and instant search from an evaluation perspective. We then consider the difficulties of employing offline evaluation metrics designed for the traditional search paradigm to assess the effectiveness of instant search. Finally, we propose a new offline evaluation metric based on the unique characteristics of instant search. To demonstrate the utility of our metric, we conduct experiments across two very different platforms employing instant search: A commercial audio streaming platform and Wikipedia.

References

  1. Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles LA Clarke. 2022. Shallow pooling for sparse labels. Information Retrieval Journal 25, 4 (2022), 365--385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Leif Azzopardi, Joel Mackenzie, and Alistair Moffat. 2021. ERR is not C/W/L: Exploring the relationship between expected reciprocal rank and other metrics. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 231--237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Krisztian Balog, Pavel Serdyukov, and Arjen P. de Vries. 2011. Overview of the TREC 2011 Entity Track. In 20th Text REtrieval Conference. Gaithersburg, Maryland.Google ScholarGoogle Scholar
  4. Fei Cai, Maarten De Rijke, et al . 2016. A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273--363.Google ScholarGoogle Scholar
  5. Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in information retrieval. 903--912.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Praveen Chandar, Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, and Jennifer Thom. 2019. Developing evaluation metrics for instant search using mixed methods methods. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 925--928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Charles LA Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining. 75--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Charles LA Clarke, Fernando Diaz, and Negar Arabzadeh. 2023. Preference-Based Offline Evaluation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1248--1251.Google ScholarGoogle Scholar
  10. William S Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American documentation 19, 1 (1968), 30--41.Google ScholarGoogle Scholar
  11. Giovanni Di Santo, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2015. Comparing Approaches for Query Autocompletion. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 775--778.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th international conference on World Wide Web. 581--590.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guglielmo Faggioli, Marco Ferrante, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2021. Hierarchical dependence-aware evaluation measures for conversational search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1935--1939.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jacek Gwizdka. 2010. Distribution of cognitive load in web search. Journal of the American Society for Information Science and Technology 61, 11 (2010), 2167--2187.Google ScholarGoogle ScholarCross RefCross Ref
  15. Helia Hashemi, Aasish Pappu, Mi Tian, Praveen Chandar, Mounia Lalmas, and Benjamin Carterette. 2021. Neural instant search for music and podcast. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2984--2992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Enamul Hoque, Orland Hoeber, and Minglun Gong. 2011. Evaluating the trade-offs between diversity and precision for Web image search using concept-based query expansion. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 130--133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kalervo Järvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4--15.Google ScholarGoogle ScholarCross RefCross Ref
  19. Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating Multi-Query Sessions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1053--1062.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Madian Khabsa, Aidan Crook, Ahmed Hassan Awadallah, Imed Zitouni, Tasos Anastasakos, and Kyle Williams. 2016. Learning to account for good abandonment in search success metrics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1893--1896.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sudarshan Lamkhede and Sudeep Das. 2019. Challenges in search on streaming services: Netflix case study. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1371--1374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Liangda Li, Hongbo Deng, and Yi Chang. Query Auto-Completion. Springer, Cham, 145--170.Google ScholarGoogle Scholar
  23. Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, Hongyuan Zha, and Ricardo Baeza-Yates. 2015. Analyzing user's sequential behavior in query auto-completion via markov processes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 123--132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2021. How am I doing?: Evaluating conversational search systems offline. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zeyang Liu, Ke Zhou, and Max L Wilson. 2021. Meta-evaluation of conversational search evaluation metrics. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chuan Meng, Negar Arabzadeh, Mohammad Aliannejadi, and Maarten de Rijke. 2023. Query Performance Prediction: From Ad-hoc to Conversational Search. arXiv preprint arXiv:2305.10923 (2023).Google ScholarGoogle Scholar
  27. Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In Proceedings of the 24th ACM international on conference on information and knowledge management. 1755--1758.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems (TOIS) 35, 3 (2017), 1--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alistair Moffat, Joel Mackenzie, Paul Thomas, and Leif Azzopardi. 2022. A flexible framework for offline effectiveness metrics. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 578--587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 659--668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (December 2008), 2:1--2:27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Heather L O'Brien and Elaine G Toms. 2008. What is user engagement? A conceptual framework for defining user engagement with technology. Journal of the American society for Information Science and Technology 59, 6 (2008), 938--955.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tetsuya Sakai and Zhaohao Zeng. 2020. Retrieval evaluation measures that agree with users' SERP preferences: Traditional, preference-based, and diversity measures. ACM Transactions on Information Systems (TOIS) 39, 2 (2020), 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 103--112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jaime Teevan, Susan T Dumais, and Daniel J Liebling. 2008. To personalize or not to personalize: modeling queries with variation in user intent. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 163--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ganesh Venkataraman, Abhimanyu Lad, Lin Guo, and Shakti Sinha. 2016. Fast, lenient and accurate: Building personalized instant search experience at linkedin. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1502--1511.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. 2019. Grid-based evaluation metrics for web image search. In The world wide web conference. 2103--2114.Google ScholarGoogle Scholar
  38. Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1561--1564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fan Zhang, Jiaxin Mao, Yiqun Liu, Xiaohui Xie, Weizhi Ma, Min Zhang, and Shaoping Ma. 2020. Models versus satisfaction: Towards a better understanding of evaluation metrics. In Proceedings of the 43rd international acm sigir conference on research and development in information retrieval. 379--388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 615--624.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ke Zhou, Hongyuan Zha, Yi Chang, and Gui-Rong Xue. 2012. Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering 26, 2 (2012), 391--404.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A is for Adele: An Offline Evaluation Metric for Instant Search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval
            August 2023
            300 pages
            ISBN:9798400700736
            DOI:10.1145/3578337

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 August 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            ICTIR '23 Paper Acceptance Rate30of73submissions,41%Overall Acceptance Rate209of482submissions,43%

            Upcoming Conference

          • Article Metrics

            • Downloads (Last 12 months)73
            • Downloads (Last 6 weeks)8

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader