ABSTRACT
Recently session search evaluation has been paid more attention as a realistic search scenario usually involves multiple queries and interactions between users and systems. Evolved from model-based evaluation metrics for a single query, existing session-based metrics also follow a generic framework based on the cascade hypothesis. The cascade hypothesis assumes that lower-ranked search results and later-issued queries receive less attention from users and should therefore be assigned smaller weights when calculating evaluation metrics. This hypothesis gains much success in modeling search users' behavior and designing evaluation metrics, by explaining why users' attention decays on search engine result pages. However, recent studies have found that the recency effect also plays an important role in determining user satisfaction in search sessions. Especially, whether a user feels satisfied in the later-issued queries heavily influences his/her search satisfaction in the whole session. To take both the cascade hypothesis and the recency effect into the design of session search evaluation metrics, we propose Recency-aware Session-based Metrics (RSMs) to simultaneously characterize users' examination process with a browsing model and cognitive process with a utility accumulation model. With both self-constructed and public available user search behavior datasets, we show the effectiveness of proposed RSMs by comparing them with existing session-based metrics in the light of correlation with user satisfaction. We also find that the influence of the cascade and the recency effects varies dramatically among tasks with different difficulties and complexities, which suggests that we should use different model parameters for different types of search tasks. Our findings highlight the importance of investigating and utilizing cognitive effects besides examination hypotheses in search evaluation.
- Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: an information foraging based measure. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 605--614.Google ScholarDigital Library
- AD Baddeley. 1968. Prior recall of newly learned items and the recency effect in free recall. Canadian Journal of Psychology/Revue canadienne de psychologie, Vol. 22, 3 (1968), 157.Google Scholar
- Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User variability and IR system evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 625--634.Google ScholarDigital Library
- Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 903--912.Google ScholarDigital Library
- Ben Carterette, Ashraf Bah, and Mustafa Zengin. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the 2015 international conference on the theory of information retrieval. ACM, 91--100.Google ScholarDigital Library
- Ben Carterette, Evangelos Kanoulas, Mark Hall, and Paul Clough. 2014. Overview of the TREC 2014 session track. Technical Report. DELAWARE UNIV NEWARK DEPT OF COMPUTER AND INFORMATION SCIENCES.Google Scholar
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 621--630.Google ScholarDigital Library
- Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. ASLIB Cranfield Research Project: factors determining the performance of indexing systems. (1966).Google Scholar
- Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. ACM, 87--94.Google ScholarDigital Library
- Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, 67--76.Google ScholarDigital Library
- Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.Google ScholarDigital Library
- Kalervo J"arvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4--15.Google Scholar
- Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval. ACM, 285--288.Google ScholarDigital Library
- Jiepu Jiang, Daqing He, and James Allan. 2014. Searching, browsing, and clicking in a search session: changes in user behavior by task and over time. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 607--616.Google ScholarDigital Library
- Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 699--708.Google ScholarDigital Library
- Santiago Larrain, Christoph Trattner, Denis Parra, Eduardo Graells-Garrido, and Kjetil Nørvåg. 2015. Good times bad times: A study on recency effects in collaborative filtering for social tagging. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, 269--272.Google ScholarDigital Library
- Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model for Query Sessions to Session Rank Biased Precision (sRBP). In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 109--116.Google ScholarDigital Library
- Jingjing Liu, Michael J Cole, Chang Liu, Ralf Bierig, Jacek Gwizdka, Nicholas J Belkin, Jun Zhang, and Xiangmin Zhang. 2010. Search behaviors in different task types. In Proceedings of the 10th annual joint conference on Digital libraries. ACM, 69--78.Google ScholarDigital Library
- Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, and Shaoping Ma. 2018. Towards designing better session search evaluation metrics. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 1121--1124.Google ScholarDigital Library
- Mengyang Liu, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating Cognitive Effects in Session-level Search User Satisfaction. KDD.Google Scholar
- Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 435--444.Google ScholarDigital Library
- Jiyun Luo, Christopher Wing, Hui Yang, and Marti Hearst. 2013. The water filling model and the cube test: multi-dimensional evaluation for professional search. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 709--714.Google ScholarDigital Library
- Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When does relevance mean usefulness and user satisfaction in Web search?. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 463--472.Google ScholarDigital Library
- Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 659--668.Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS), Vol. 27, 1 (2008), 2.Google ScholarDigital Library
- Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 473--482.Google ScholarDigital Library
- Mark Sanderson et almbox. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval, Vol. 4, 4 (2010), 247--375.Google Scholar
- Mark D Smucker and Charles LA Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 95--104.Google ScholarDigital Library
- Zhiwen Tang and Grace Hui Yang. 2017. Investigating per topic upper bound for session search evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 185--192.Google ScholarDigital Library
- Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. science, Vol. 185, 4157 (1974), 1124--1131.Google Scholar
- Zhijing Wu, Yiqun Liu, Qianfan Zhang, Kailu Wu, Min Zhang, and Shaoping Ma. 2019. The influence of image search intents on user behavior and satisfaction. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 645--653.Google ScholarDigital Library
- Grace Hui Yang and Ian Soboroff. 2016. TREC 2016 Dynamic Domain Track Overview.. In TREC.Google Scholar
- Yiming Yang and Abhimanyu Lad. 2009. Modeling expected utility of multi-session information distillation. In Conference on the Theory of Information Retrieval. Springer, 164--175.Google ScholarDigital Library
- Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 1561--1564.Google ScholarDigital Library
- Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017b. Evaluating web search with a bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 425--434.Google ScholarDigital Library
- Yinan Zhang, Xueqing Liu, and ChengXiang Zhai. 2017a. Information retrieval evaluation as search simulation: A general formal framework for ir evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 193--200.Google ScholarDigital Library
- Yuye Zhang, Laurence AF Park, and Alistair Moffat. 2010. Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval, Vol. 13, 1 (2010), 46--69.Google ScholarDigital Library
Index Terms
- Cascade or Recency: Constructing Better Evaluation Metrics for Session Search
Recommendations
Towards Designing Better Session Search Evaluation Metrics
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalUser satisfaction has been paid much attention to in recent Web search evaluation studies and regarded as the ground truth for designing better evaluation metrics. However, most existing studies are focused on the relationship between satisfaction and ...
Investigating Cognitive Effects in Session-level Search User Satisfaction
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningUser satisfaction is an important variable in Web search evaluation studies and has received more and more attention in recent years. Many studies regard user satisfaction as the ground truth for designing better evaluation metrics. However, most of the ...
Grid-based Evaluation Metrics for Web Image Search
WWW '19: The World Wide Web ConferenceCompared to general web search engines, web image search engines display results in a different way. In web image search, results are typically placed in a grid-based manner rather than a sequential result list. In this scenario, users can view results ...
Comments