research-article

Dealing with Incomplete Judgments in Cascade Measures

Authors:

Klaus Berberich,

Ida MeleAuthors Info & Claims

ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

Pages 83 - 90

https://doi.org/10.1145/3121050.3121064

Published: 01 October 2017 Publication History

Abstract

Cascade measures like alpha-nDCG, ERR-IA, and NRBP take into account novelty and diversity of query results and are computed using judgments provided by humans, which are costly to collect. These measures expect that all documents in the result list of a query are judged and cannot make use of judgments beyond the assigned labels. Existing work has demonstrated that condensing the query results by taking out documents without judgment can address this problem to some extent. However, how highly incomplete judgments can affect cascade measures and how to cope with such incompleteness have not been addressed yet. In this paper, we propose an approach which mitigates incomplete judgments by leveraging the content of documents relevant to the query's subtopics. These language models are estimated at each rank taking into account the document and the upper ranked ones. Then, our method determines gain values based on the Kullback-Leibler divergence between the language models. Experiments on the diversity tasks of the TREC Web Track 2009--2012 show that with only 15% of the judgments our method accurately reconstructs the original rankings determined by the established cascade measures.

References

[1]

R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM '09, pages 5--14, New York, NY, USA, 2009. ACM.

Digital Library

[2]

E. Amitay, D. Carmel, R. Lempel, and A. Soffer. Scaling IR-system Evaluation Using Term Relevance Sets. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '04, pages 10--17, New York, NY, USA, 2004. ACM.

Digital Library

[3]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 541--548, New York, NY, USA, 2006. ACM.

Digital Library

[4]

T. Bompada, C.-C. Chang, J. Chen, R. Kumar, and R. Shenoy. On the robustness of relevance measures with incomplete judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 359--366, New York, NY, USA, 2007. ACM.

Digital Library

[5]

C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '04, pages 25--32, New York, NY, USA, 2004. ACM.

Digital Library

[6]

S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 63--70, New York, NY, USA, 2007. ACM.

Digital Library

[7]

B. Carterette and J. Allan. Semiautomatic evaluation of retrieval systems using document similarities. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, CIKM '07, pages 873--876, New York, NY, USA, 2007. ACM.

Digital Library

[8]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 268--275, New York, NY, USA, 2006. ACM.

Digital Library

[9]

O. Chapelle, S. Ji, C. Liao, E. Velipasaoglu, L. Lai, and S.-L. Wu. Intent-based diversification of web search results: metrics and algorithms. Information Retrieval, 14(6):572--592, 2011.

Digital Library

[10]

O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM.

Digital Library

[11]

C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 659--666, New York, NY, USA, 2008. ACM.

Digital Library

[12]

C. L. Clarke, M. Kolla, and O. Vechtomova. An effectiveness measure for ambiguous and underspecified queries. In Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory, ICTIR '09, pages 188--199, Berlin, Heidelberg, 2009. Springer-Verlag.

Digital Library

[13]

V. Dang and W. B. Croft. Diversity by proportionality: an election-based approach to search result diversification. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 65--74. ACM, 2012.

Digital Library

[14]

V. Dang and W. B. Croft. Term level search result diversification. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 603--612. ACM, 2013.

Digital Library

[15]

S. Hu, Z. Dou, X. Wang, T. Sakai, and J.-R. Wen. Search result diversification based on hierarchical intents. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 63--72. ACM, 2015.

Digital Library

[16]

K. Hui and K. Berberich. Selective labeling and incomplete label mitigation for low-cost evaluation. In International Symposium on String Processing and Information Retrieval, pages 137--148. Springer International Publishing, 2015.

Digital Library

[17]

K. J"arvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20:422--446, October 2002.

Digital Library

[18]

G. K. Jayasinghe, W. Webber, M. Sanderson, and J. S. Culpepper. Improving test collection pools with machine learning. In Proceedings of the 2014 Australasian Document Computing Symposium, ADCS '14, pages 2:2--2:9, New York, NY, USA, 2014. ACM.

Digital Library

[19]

A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1--2:27, Dec. 2008.

Digital Library

[20]

T. Sakai. Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 71--78. ACM, 2007.

Digital Library

[21]

T. Sakai. The unreusability of diversified search test collections. In EVIA@ NTCIR, 2013.

[22]

T. Sakai, Z. Dou, R. Song, and N. Kando. The reusability of a diversified search test collection. In Asia Information Retrieval Symposium, pages 26--38. Springer, 2012.

[23]

T. Sakai and N. Kando. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr., 11(5):447--470, 2008.

Digital Library

[24]

E. M. Voorhees. Evaluation by highly relevant documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, pages 74--82, New York, NY, USA, 2001. ACM.

Digital Library

[25]

X. Wang, Z. Dou, T. Sakai, and J.-R. Wen. Evaluating search result diversity using intent hierarchies. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 415--424. ACM, 2016.

Digital Library

[26]

E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM '06, pages 102--111, New York, NY, USA, 2006. ACM.

Digital Library

[27]

C. Zhai. Statistical language models for information retrieval a critical review. Found. Trends Inf. Retr., 2:137--213, March 2008.

Digital Library

[28]

C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR '03, pages 10--17, New York, NY, USA, 2003. ACM.

Digital Library

Cited By

MacAvaney SSoldaini LChen HDuh WHuang HKato MMothe JPoblete B(2023)One-Shot Labeling for Automatic Relevance EstimationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592032(2230-2235)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592032

Index Terms

Dealing with Incomplete Judgments in Cascade Measures
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Preference based evaluation measures for novelty and diversity
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Novel and diverse document ranking is an effective strategy that involves reducing redundancy in a ranked list to maximize the amount of novel and relevant information available to users. Evaluation for novelty and diversity typically involves an ...
A document rating system for preference judgements
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

High quality relevance judgments are essential for the evaluation of information retrieval systems. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by ...
Generalising Kendall's Tau for Noisy and Incomplete Preference Judgements
ICTIR '19: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval

We propose a new ranking evaluation measure for situations where multiple preference judgements are given for each item pair but they may be noisy (i.e., some judgements are unreliable) and/or incomplete (i.e., some judgements are missing). While it is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

October 2017

348 pages

ISBN:9781450344906

DOI:10.1145/3121050

General Chairs:
Jaap Kamps
University of Amsterdam, The Netherlands
,
Evangelos Kanoulas
University of Amsterdam, The Netherlands
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Program Chairs:
Hui Fang
University of Delaware, USA
,
Emine Yilmaz
University College London, UK

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICTIR '17

Sponsor:

SIGIR

ICTIR '17: ACM SIGIR International Conference on the Theory of Information Retrieval

October 1 - 4, 2017

Amsterdam, The Netherlands

Acceptance Rates

ICTIR '17 Paper Acceptance Rate 27 of 54 submissions, 50%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
63
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

MacAvaney SSoldaini LChen HDuh WHuang HKato MMothe JPoblete B(2023)One-Shot Labeling for Automatic Relevance EstimationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592032(2230-2235)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592032

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents