skip to main content
10.1145/3397271.3401163acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Cascade or Recency: Constructing Better Evaluation Metrics for Session Search

Published:25 July 2020Publication History

ABSTRACT

Recently session search evaluation has been paid more attention as a realistic search scenario usually involves multiple queries and interactions between users and systems. Evolved from model-based evaluation metrics for a single query, existing session-based metrics also follow a generic framework based on the cascade hypothesis. The cascade hypothesis assumes that lower-ranked search results and later-issued queries receive less attention from users and should therefore be assigned smaller weights when calculating evaluation metrics. This hypothesis gains much success in modeling search users' behavior and designing evaluation metrics, by explaining why users' attention decays on search engine result pages. However, recent studies have found that the recency effect also plays an important role in determining user satisfaction in search sessions. Especially, whether a user feels satisfied in the later-issued queries heavily influences his/her search satisfaction in the whole session. To take both the cascade hypothesis and the recency effect into the design of session search evaluation metrics, we propose Recency-aware Session-based Metrics (RSMs) to simultaneously characterize users' examination process with a browsing model and cognitive process with a utility accumulation model. With both self-constructed and public available user search behavior datasets, we show the effectiveness of proposed RSMs by comparing them with existing session-based metrics in the light of correlation with user satisfaction. We also find that the influence of the cascade and the recency effects varies dramatically among tasks with different difficulties and complexities, which suggests that we should use different model parameters for different types of search tasks. Our findings highlight the importance of investigating and utilizing cognitive effects besides examination hypotheses in search evaluation.

References

  1. Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: an information foraging based measure. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 605--614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AD Baddeley. 1968. Prior recall of newly learned items and the recency effect in free recall. Canadian Journal of Psychology/Revue canadienne de psychologie, Vol. 22, 3 (1968), 157.Google ScholarGoogle Scholar
  3. Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User variability and IR system evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 625--634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 903--912.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Carterette, Ashraf Bah, and Mustafa Zengin. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the 2015 international conference on the theory of information retrieval. ACM, 91--100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ben Carterette, Evangelos Kanoulas, Mark Hall, and Paul Clough. 2014. Overview of the TREC 2014 session track. Technical Report. DELAWARE UNIV NEWARK DEPT OF COMPUTER AND INFORMATION SCIENCES.Google ScholarGoogle Scholar
  7. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 621--630.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. ASLIB Cranfield Research Project: factors determining the performance of indexing systems. (1966).Google ScholarGoogle Scholar
  9. Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. ACM, 87--94.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, 67--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kalervo J"arvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4--15.Google ScholarGoogle Scholar
  13. Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval. ACM, 285--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiepu Jiang, Daqing He, and James Allan. 2014. Searching, browsing, and clicking in a search session: changes in user behavior by task and over time. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 607--616.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 699--708.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Santiago Larrain, Christoph Trattner, Denis Parra, Eduardo Graells-Garrido, and Kjetil Nørvåg. 2015. Good times bad times: A study on recency effects in collaborative filtering for social tagging. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, 269--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model for Query Sessions to Session Rank Biased Precision (sRBP). In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 109--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jingjing Liu, Michael J Cole, Chang Liu, Ralf Bierig, Jacek Gwizdka, Nicholas J Belkin, Jun Zhang, and Xiangmin Zhang. 2010. Search behaviors in different task types. In Proceedings of the 10th annual joint conference on Digital libraries. ACM, 69--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, and Shaoping Ma. 2018. Towards designing better session search evaluation metrics. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 1121--1124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mengyang Liu, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating Cognitive Effects in Session-level Search User Satisfaction. KDD.Google ScholarGoogle Scholar
  21. Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 435--444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiyun Luo, Christopher Wing, Hui Yang, and Marti Hearst. 2013. The water filling model and the cube test: multi-dimensional evaluation for professional search. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 709--714.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When does relevance mean usefulness and user satisfaction in Web search?. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 463--472.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 659--668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS), Vol. 27, 1 (2008), 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 473--482.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mark Sanderson et almbox. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval, Vol. 4, 4 (2010), 247--375.Google ScholarGoogle Scholar
  28. Mark D Smucker and Charles LA Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 95--104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhiwen Tang and Grace Hui Yang. 2017. Investigating per topic upper bound for session search evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 185--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. science, Vol. 185, 4157 (1974), 1124--1131.Google ScholarGoogle Scholar
  31. Zhijing Wu, Yiqun Liu, Qianfan Zhang, Kailu Wu, Min Zhang, and Shaoping Ma. 2019. The influence of image search intents on user behavior and satisfaction. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 645--653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Grace Hui Yang and Ian Soboroff. 2016. TREC 2016 Dynamic Domain Track Overview.. In TREC.Google ScholarGoogle Scholar
  33. Yiming Yang and Abhimanyu Lad. 2009. Modeling expected utility of multi-session information distillation. In Conference on the Theory of Information Retrieval. Springer, 164--175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 1561--1564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017b. Evaluating web search with a bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 425--434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yinan Zhang, Xueqing Liu, and ChengXiang Zhai. 2017a. Information retrieval evaluation as search simulation: A general formal framework for ir evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 193--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yuye Zhang, Laurence AF Park, and Alistair Moffat. 2010. Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval, Vol. 13, 1 (2010), 46--69.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cascade or Recency: Constructing Better Evaluation Metrics for Session Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2020
      2548 pages
      ISBN:9781450380164
      DOI:10.1145/3397271

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 July 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader