research-article

Exploiting Stopping Time to Evaluate Accumulated Relevance

Authors:
Marco Ferrante

University of Padova, Padua, Italy

University of Padova, Padua, Italy
View Profile

,
Nicola Ferro

University of Padova, Padua, Italy

University of Padova, Padua, Italy
View Profile

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information RetrievalSeptember 2020Pages 169–176https://doi.org/10.1145/3409256.3409832

Published:14 September 2020Publication History

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Pages 169–176

ABSTRACT

Evaluation measures are more or less explicitly based on user models which abstract how users interact with a ranked result list and how they accumulate utility from it. However, traditional measures typically come with a hard-coded user model which can be, at best, parametrized. Moreover, they take a deterministic approach which leads to assign a precise score to a system run.

In this paper, we take a different angle and, by relying on Markov chains and random walks, we propose a new family of evaluation measures which are able to accommodate for different and flexible user models, allow for simulating the interaction of different users, and turn the score into a random variable which more richly describes the performance of a system. We also show how the proposed framework allows for instantiating and better explaining some state-of-the-art measures, like AP, RBP, DCG, and ERR.

References

L. Azzopardi, P. Thomas, and N. Craswell. 2018. Measuring the Utility of Search Engine Result Pages. In Proc. 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018), K. Collins-Thompson, Q. Mei, B. Davison, Y. Liu, and E. Yilmaz (Eds.). ACM Press, New York, USA, 605--614.Google Scholar
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2015. User Variability and IR System Evaluation. In Proc. 38th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015), R. Baeza-Yates, M. Lalmas, A. Moffat, and B. Ribeiro-Neto (Eds.). ACM Press, New York, USA, 625--634.Google Scholar
F. Baskaya, H. Keskustalo, and K. Jarvelin. 2013. Modeling Behavioral Factors in Interactive Information Retrieval. In Proc. 22nd International Conference on Information and Knowledge Management (CIKM 2013), A. Iyengar, Q. He, J. Pei, R. Rastogi, and W. Nejdl (Eds.). ACM Press, New York, USA, 2297--2302.Google Scholar
C. Buckley and E. M. Voorhees. 2005. Retrieval System Evaluation. In TREC. Experiment and Evaluation in Information Retrieval, D. K. Harman and E. M. Voorhees (Eds.). MIT Press, Cambridge (MA), USA, 53--78.Google Scholar
B. A. Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proc. 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), W.-Y. Ma, J.-Y. Nie, R. Baeza-Yaetes, T.-S. Chua, and W. B. Croft (Eds.). ACM Press, New York, USA, 903--912.Google ScholarDigital Library
B. A. Carterette, E. Kanoulas, and E. Yilmaz. 2012. Incorporating Variability in User Behavior into Systems Based Evaluation. In Proc. 21st International Conference on Information and Knowledge Management (CIKM 2012), X. Chen, G. Lebanon, H. Wang, and M. J. Zaki (Eds.). ACM Press, New York, USA, 135--144.Google Scholar
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630.Google Scholar
W. S. Cooper. 1968. Expected Search Length: A Single Measure of Retrieval Effectiveness. American Documentation, Vol. 19, 1 (January 1968), 30--41.Google ScholarCross Ref
S. Dungs and N. Fuhr. 2017. Advanced Hidden Markov Models for Recognizing Search Phases. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 257--260.Google ScholarDigital Library
M. Ferrante, N. Ferro, and E. Losiouk. 2020. How do interval scales help us with better understanding IR evaluation measures? Information Retrieval Journal, Vol. 23, 3 (June 2020), 289--317.Google ScholarDigital Library
M. Ferrante, N. Ferro, and M. Maistro. 2014. Injecting User Models and Time into Precision via Markov Chains. In Proc. 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke, and K. J"arvelin (Eds.). ACM Press, New York, USA, 597--606.Google Scholar
M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 21--30.Google Scholar
M. Ferrante, N. Ferro, and S. Pontarollo. 2017. Are IR Evaluation Measures on an Interval Scale?. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 67--74.Google ScholarDigital Library
M. Ferrante, N. Ferro, and S. Pontarollo. 2019. A General Theory of IR Evaluation Measures. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 31, 3 (March 2019), 409--422.Google ScholarDigital Library
N. Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (December 2017), 32--41.Google Scholar
J. Hadar and W. R. Russell. 1969. Rules for Ordering Uncertain Prospects. The American Economic Review, Vol. 59, 1 (1969), 25--34.Google Scholar
K. Jarvelin and J. Kekalainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (October 2002), 422--446.Google ScholarDigital Library
D. Maxwell and L. Azzopardi. 2016. Simulating Interactive Information Retrieval: SimIIR: A Framework for the Simulation of Interaction. In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New York, USA, 1141--1144.Google Scholar
D. M. Maxwell. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph.D. Dissertation. School of Computing Science, College of Science and Engineering, University of Glasgow, Scotland, UK.Google Scholar
A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems (TOIS), Vol. 27, 1 (December 2008), 2:1--2:27.Google ScholarDigital Library
J. R. Norris. 1998. Markov chains. Cambridge University Press, UK.Google Scholar
S. Robertson. 2008. A New Interpretation of Average Precision. In Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). ACM Press, New York, USA, 689--690.Google ScholarDigital Library
G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.Google Scholar
T. Sakai and Z. Dou. 2013. Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation. In Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 473--482.Google Scholar
M. Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval (FnTIR), Vol. 4, 4 (2010), 247--375.Google ScholarCross Ref
P. Serdyukov, N. Craswell, and G. Dupret. 2012. WSCD2012: Workshop on Web Search Click Data 2012. In Proc. 5th ACM International Conference on Web Searching and Data Mining (WSDM 2012), E. Adar, J. Teevan, E. Agichtein, and Y. Maarek (Eds.). ACM Press, New York, USA, 771--772.Google Scholar
M. D. Smucker and C. L. A. Clarke. 2012a. Stochastic Simulation of Time-Biased Gain. In Proc. 21st International Conference on Information and Knowledge Management (CIKM 2012), X. Chen, G. Lebanon, H. Wang, and M. J. Zaki (Eds.). ACM Press, New York, USA, 2040--2044.Google ScholarDigital Library
M. D. Smucker and C. L. A. Clarke. 2012b. Time-Based Calibration of Effectiveness Measures. In Proc. 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), W. Hersh, J. Callan, Y. Maarek, and M. Sanderson (Eds.). ACM Press, New York, USA, 95--104.Google ScholarDigital Library
P. Thomas, P. Bailey, A. Moffat, and F. Scholer. 2014. Modeling Decision Points in User Search Behavior. In Proc. 5th Symposium on Information Interaction in Context (IIiX 2014), D. Elsweiler, B. Ludwig, L. Azzopardi, and M. Wilson (Eds.). ACM Press, New York, USA, 239--242.Google ScholarDigital Library
D. van Dijk, M. Ferrante, N. Ferro, and E. Kanoulas. 2019. A Markovian Approach to Evaluate Session-based IR Systems. In Advances in Information Retrieval. Proc. 41st European Conference on IR Research (ECIR 2019) -- Part I, L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff, and D. Hiemstra (Eds.). Lecture Notes in Computer Science (LNCS) 11437, Springer, Heidelberg, Germany, 621--635.Google Scholar
J. von Neumann and O. Morgenstern. 1953. Theory of Games and Economic Behavior 3rd ed.). Princeton University Press, Princeton (NJ), USA.Google Scholar
E. Yilmaz and J. A. Aslam. 2006. Estimating Average Precision With Incomplete and Imperfect Judgments. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 102--111.Google Scholar
E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. 2010. Expected Browsing Utility for Web Search Evaluation. In Proc. 19th International Conference on Information and Knowledge Management (CIKM 2010), J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An (Eds.). ACM Press, New York, USA, 1561--1565.Google Scholar
F. Zhang, Y. Liu, X. Li, M. Zhang, Y. Xu, and S. Ma. 2017b. EvaluatingWeb Search with a Bejeweled Player Model. In Proc. 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM Press, New York, USA, 425--434.Google ScholarDigital Library
Y. Zhang, X. Liu, and C. Zhai. 2017a. Information Retrieval Evaluation as Search Simulation: A General Formal Framework for IR Evaluation. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 193--200.Google ScholarDigital Library

Index Terms

Exploiting Stopping Time to Evaluate Accumulated Relevance
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

Expected reciprocal rank for graded relevance
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the ...
Read More
Metrics, User Models, and Satisfaction
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

User satisfaction is an important factor when evaluating search systems, and hence a good metric should give rise to scores that have a strong positive correlation with user satisfaction ratings. A metric should also correspond to a plausible user model,...
Read More
A user behavior model for average precision and its generalization to graded judgments
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

We explore a set of hypothesis on user behavior that are potentially at the origin of the (Mean) Average Precision (AP) metric. This allows us to propose a more realistic version of AP where users click non-deterministically on relevant documents and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval
September 2020
207 pages
ISBN:9781450380676
DOI:10.1145/3409256
General Chairs:
Krisztian Balog
University of Stavanger, Norway
,
Vinay Setty
University of Stavanger, Norway
,
Program Chairs:
Christina Lioma
University of Copenhagen, Denmark
,
Yiqun Liu
Tsinghua University, China
,
Min Zhang
Tsinghua University, China
,
Klaus Berberich
HTW Saar & MPI for Informatics, Germany
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
Markov chain
evaluation measure
stopping time
user model
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate209of482submissions,43%
Upcoming Conference
ICTIR '24

Sponsor:

sigir

The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC , DC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 105
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting Stopping Time to Evaluate Accumulated Relevance

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Expected reciprocal rank for graded relevance

Metrics, User Models, and Satisfaction

A user behavior model for average precision and its generalization to graded judgments