ABSTRACT
Evaluation measures are more or less explicitly based on user models which abstract how users interact with a ranked result list and how they accumulate utility from it. However, traditional measures typically come with a hard-coded user model which can be, at best, parametrized. Moreover, they take a deterministic approach which leads to assign a precise score to a system run.
In this paper, we take a different angle and, by relying on Markov chains and random walks, we propose a new family of evaluation measures which are able to accommodate for different and flexible user models, allow for simulating the interaction of different users, and turn the score into a random variable which more richly describes the performance of a system. We also show how the proposed framework allows for instantiating and better explaining some state-of-the-art measures, like AP, RBP, DCG, and ERR.
- L. Azzopardi, P. Thomas, and N. Craswell. 2018. Measuring the Utility of Search Engine Result Pages. In Proc. 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018), K. Collins-Thompson, Q. Mei, B. Davison, Y. Liu, and E. Yilmaz (Eds.). ACM Press, New York, USA, 605--614.Google Scholar
- P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2015. User Variability and IR System Evaluation. In Proc. 38th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015), R. Baeza-Yates, M. Lalmas, A. Moffat, and B. Ribeiro-Neto (Eds.). ACM Press, New York, USA, 625--634.Google Scholar
- F. Baskaya, H. Keskustalo, and K. Jarvelin. 2013. Modeling Behavioral Factors in Interactive Information Retrieval. In Proc. 22nd International Conference on Information and Knowledge Management (CIKM 2013), A. Iyengar, Q. He, J. Pei, R. Rastogi, and W. Nejdl (Eds.). ACM Press, New York, USA, 2297--2302.Google Scholar
- C. Buckley and E. M. Voorhees. 2005. Retrieval System Evaluation. In TREC. Experiment and Evaluation in Information Retrieval, D. K. Harman and E. M. Voorhees (Eds.). MIT Press, Cambridge (MA), USA, 53--78.Google Scholar
- B. A. Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proc. 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), W.-Y. Ma, J.-Y. Nie, R. Baeza-Yaetes, T.-S. Chua, and W. B. Croft (Eds.). ACM Press, New York, USA, 903--912.Google ScholarDigital Library
- B. A. Carterette, E. Kanoulas, and E. Yilmaz. 2012. Incorporating Variability in User Behavior into Systems Based Evaluation. In Proc. 21st International Conference on Information and Knowledge Management (CIKM 2012), X. Chen, G. Lebanon, H. Wang, and M. J. Zaki (Eds.). ACM Press, New York, USA, 135--144.Google Scholar
- O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630.Google Scholar
- W. S. Cooper. 1968. Expected Search Length: A Single Measure of Retrieval Effectiveness. American Documentation, Vol. 19, 1 (January 1968), 30--41.Google ScholarCross Ref
- S. Dungs and N. Fuhr. 2017. Advanced Hidden Markov Models for Recognizing Search Phases. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 257--260.Google ScholarDigital Library
- M. Ferrante, N. Ferro, and E. Losiouk. 2020. How do interval scales help us with better understanding IR evaluation measures? Information Retrieval Journal, Vol. 23, 3 (June 2020), 289--317.Google ScholarDigital Library
- M. Ferrante, N. Ferro, and M. Maistro. 2014. Injecting User Models and Time into Precision via Markov Chains. In Proc. 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke, and K. J"arvelin (Eds.). ACM Press, New York, USA, 597--606.Google Scholar
- M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 21--30.Google Scholar
- M. Ferrante, N. Ferro, and S. Pontarollo. 2017. Are IR Evaluation Measures on an Interval Scale?. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 67--74.Google ScholarDigital Library
- M. Ferrante, N. Ferro, and S. Pontarollo. 2019. A General Theory of IR Evaluation Measures. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 31, 3 (March 2019), 409--422.Google ScholarDigital Library
- N. Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (December 2017), 32--41.Google Scholar
- J. Hadar and W. R. Russell. 1969. Rules for Ordering Uncertain Prospects. The American Economic Review, Vol. 59, 1 (1969), 25--34.Google Scholar
- K. Jarvelin and J. Kekalainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (October 2002), 422--446.Google ScholarDigital Library
- D. Maxwell and L. Azzopardi. 2016. Simulating Interactive Information Retrieval: SimIIR: A Framework for the Simulation of Interaction. In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New York, USA, 1141--1144.Google Scholar
- D. M. Maxwell. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph.D. Dissertation. School of Computing Science, College of Science and Engineering, University of Glasgow, Scotland, UK.Google Scholar
- A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems (TOIS), Vol. 27, 1 (December 2008), 2:1--2:27.Google ScholarDigital Library
- J. R. Norris. 1998. Markov chains. Cambridge University Press, UK.Google Scholar
- S. Robertson. 2008. A New Interpretation of Average Precision. In Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). ACM Press, New York, USA, 689--690.Google ScholarDigital Library
- G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.Google Scholar
- T. Sakai and Z. Dou. 2013. Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation. In Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 473--482.Google Scholar
- M. Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval (FnTIR), Vol. 4, 4 (2010), 247--375.Google ScholarCross Ref
- P. Serdyukov, N. Craswell, and G. Dupret. 2012. WSCD2012: Workshop on Web Search Click Data 2012. In Proc. 5th ACM International Conference on Web Searching and Data Mining (WSDM 2012), E. Adar, J. Teevan, E. Agichtein, and Y. Maarek (Eds.). ACM Press, New York, USA, 771--772.Google Scholar
- M. D. Smucker and C. L. A. Clarke. 2012a. Stochastic Simulation of Time-Biased Gain. In Proc. 21st International Conference on Information and Knowledge Management (CIKM 2012), X. Chen, G. Lebanon, H. Wang, and M. J. Zaki (Eds.). ACM Press, New York, USA, 2040--2044.Google ScholarDigital Library
- M. D. Smucker and C. L. A. Clarke. 2012b. Time-Based Calibration of Effectiveness Measures. In Proc. 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), W. Hersh, J. Callan, Y. Maarek, and M. Sanderson (Eds.). ACM Press, New York, USA, 95--104.Google ScholarDigital Library
- P. Thomas, P. Bailey, A. Moffat, and F. Scholer. 2014. Modeling Decision Points in User Search Behavior. In Proc. 5th Symposium on Information Interaction in Context (IIiX 2014), D. Elsweiler, B. Ludwig, L. Azzopardi, and M. Wilson (Eds.). ACM Press, New York, USA, 239--242.Google ScholarDigital Library
- D. van Dijk, M. Ferrante, N. Ferro, and E. Kanoulas. 2019. A Markovian Approach to Evaluate Session-based IR Systems. In Advances in Information Retrieval. Proc. 41st European Conference on IR Research (ECIR 2019) -- Part I, L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff, and D. Hiemstra (Eds.). Lecture Notes in Computer Science (LNCS) 11437, Springer, Heidelberg, Germany, 621--635.Google Scholar
- J. von Neumann and O. Morgenstern. 1953. Theory of Games and Economic Behavior 3rd ed.). Princeton University Press, Princeton (NJ), USA.Google Scholar
- E. Yilmaz and J. A. Aslam. 2006. Estimating Average Precision With Incomplete and Imperfect Judgments. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 102--111.Google Scholar
- E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. 2010. Expected Browsing Utility for Web Search Evaluation. In Proc. 19th International Conference on Information and Knowledge Management (CIKM 2010), J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An (Eds.). ACM Press, New York, USA, 1561--1565.Google Scholar
- F. Zhang, Y. Liu, X. Li, M. Zhang, Y. Xu, and S. Ma. 2017b. EvaluatingWeb Search with a Bejeweled Player Model. In Proc. 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM Press, New York, USA, 425--434.Google ScholarDigital Library
- Y. Zhang, X. Liu, and C. Zhai. 2017a. Information Retrieval Evaluation as Search Simulation: A General Formal Framework for IR Evaluation. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 193--200.Google ScholarDigital Library
Index Terms
- Exploiting Stopping Time to Evaluate Accumulated Relevance
Recommendations
Expected reciprocal rank for graded relevance
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementWhile numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the ...
Metrics, User Models, and Satisfaction
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data MiningUser satisfaction is an important factor when evaluating search systems, and hence a good metric should give rise to scores that have a strong positive correlation with user satisfaction ratings. A metric should also correspond to a plausible user model,...
A user behavior model for average precision and its generalization to graded judgments
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalWe explore a set of hypothesis on user behavior that are potentially at the origin of the (Mean) Average Precision (AP) metric. This allows us to propose a more realistic version of AP where users click non-deterministically on relevant documents and ...
Comments