research-article

Extending average precision to graded relevance judgments

Authors:
Stephen E. Robertson

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Evangelos Kanoulas

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

,
Emine Yilmaz

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalJuly 2010Pages 603–610https://doi.org/10.1145/1835449.1835550

Published:19 July 2010Publication History

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 603–610

ABSTRACT

Evaluation metrics play a critical role both in the context of comparative evaluation of the performance of retrieval systems and in the context of learning-to-rank (LTR) as objective functions to be optimized. Many different evaluation metrics have been proposed in the IR literature, with average precision (AP) being the dominant one due a number of desirable properties it possesses. However, most of these measures, including average precision, do not incorporate graded relevance.

In this work, we propose a new measure of retrieval effectiveness, the Graded Average Precision (GAP). GAP generalizes average precision to the case of multi-graded relevance and inherits all the desirable characteristics of AP: it has a nice probabilistic interpretation, it approximates the area under a graded precision-recall curve and it can be justified in terms of a simple but moderately plausible user model. We then evaluate GAP in terms of its informativeness and discriminative power. Finally, we show that GAP can reliably be used as an objective metric in learning to rank by illustrating that optimizing for GAP using SoftRank and LambdaRank leads to better performing ranking functions than the ones constructed by algorithms tuned to optimize for AP or NDCG even when using AP or NDCG as the test metrics.

References

A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between ir effectiveness measures and user satisfaction. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 773--774, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
J. A. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy method for analyzing retrieval measures. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27--34. ACM Press, August 2005. Google ScholarDigital Library
P. Bailey, N. Craswell, A. P. de Vries, I. Soboroff, and P. Thomas. Overview of the trec 2008 enterprise track. In Proceedings of the Seventeenth Text REtrieval Conference (TREC 2008), 2008.Google Scholar
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 89--96, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In B. Scholkopf, J. C. Platt, T. Hoffman, B. Scholkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 193--200. MIT Press, 2006.Google Scholar
P. Donmez, K. M. Svore, and C. J. Burges. On the local optimality of lambdarank. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 460--467, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. Ir evaluation methods for retrieving highly relevant documents. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41--48, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002. Google ScholarDigital Library
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154--161, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
E. Kanoulas and J. A. Aslam. Empirical justification of the gain and discount function for ndcg. In To appear in CIKM 09: Proceedings of the 18th ACM international conference on Information and knowledge management, 2009. Google ScholarDigital Library
J. Kekalainen. Binary and graded relevance in ir evaluations: comparison of the effects on ranking of ir systems. Inf. Process. Manage., 41(5):1019--1033, 2005. Google ScholarDigital Library
J. Kekalainen and K. Jarvelin. Using graded relevance assessments in ir evaluation. J. Am. Soc. Inf. Sci. Technol., 53(13):1120--1129, 2002. Google ScholarDigital Library
T. Minka, J. Winn, J. Guiver, and A. Kannan. Infer.net user guide : Tutorials and examples.Google Scholar
M. S. Pollock. Measures for the comparison of information retrieval systems. American Documentation, 19(4):387--397, 1968.Google ScholarCross Ref
S. Robertson. A new interpretation of average precision. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 689--690, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
T. Sakai. Ranking the NTCIR Systems Based on Multigrade Relevance, volume 3411/2005 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, February 2005. Google ScholarDigital Library
T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 525--532, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
T. Sakai. On penalising late arrival of relevant documents in information retrieval evaluation with graded relevance. In First International Workshop on Evaluating Information Access (EVIA 2007), pages 32--43, 2007.Google Scholar
T. Sakai and S. Robertson. Modelling a user population for designing information retrieval metrics. In The Second International Workshop on Evaluating Information Access (EVIA 2008) (NTCIR-7 workshop) Tokyo, December 2008, 2008.Google Scholar
J. X. Tao Qin, Tie-Yan Liu and H. Li. Letor: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval Journal, 2010. Google ScholarDigital Library
M. Taylor, J. Guiver, S. E. Robertson, and T. Minka. Softrank: optimizing non-smooth rank metrics. In WSDM '08: Proceedings of the international conference on Web search and web data mining, pages 77--86, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
E. M. Voorhees. Evaluation by highly relevant documents. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 74--82, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391--398, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In P. S. Yu, V. Tsotras, E. Fox, and B. Liu, editors, Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management, pages 102--111. ACM Press, November 2006. Google ScholarDigital Library
E. Yilmaz and S. Robertson. Deep versus shallow judgments in learning to rank. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 662--663, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library

Index Terms

Extending average precision to graded relevance judgments
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Estimating average precision with incomplete and imperfect judgments
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref ...
Read More
Estimating average precision when judgments are incomplete

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, ...
Read More
Evaluating diversified search results using per-intent graded relevance
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
average precision
effectiveness metrics
graded relevance
information retrieval
learning to rank
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '10 Paper Acceptance Rate87of520submissions,17%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 779
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extending average precision to graded relevance judgments

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Estimating average precision with incomplete and imperfect judgments

Estimating average precision when judgments are incomplete

Evaluating diversified search results using per-intent graded relevance