research-article

An uncertainty-aware query selection model for evaluation of IR systems

Authors:

Mehdi Hosseini,

Ingemar J. Cox,

Natasa Milic-Frayling,

Milad Shokouhi,

Emine YilmazAuthors Info & Claims

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 901 - 910

https://doi.org/10.1145/2348283.2348403

Published: 12 August 2012 Publication History

Abstract

We propose a mathematical framework for query selection as a mechanism for reducing the cost of constructing information retrieval test collections. In particular, our mathematical formulation explicitly models the uncertainty in the retrieval effectiveness metrics that is introduced by the absence of relevance judgments. Since the optimization problem is computationally intractable, we devise an adaptive query selection algorithm, referred to as Adaptive, that provides an approximate solution. Adaptive selects queries iteratively and assumes that no relevance judgments are available for the query under consideration. Once a query is selected, the associated relevance assessments are acquired and then used to aid the selection of subsequent queries. We demonstrate the effectiveness of the algorithm on two TREC test collections as well as a test collection of an online search engine with 1000 queries. Our experimental results show that the queries chosen by Adaptive produce reliable performance ranking of systems. The ranking is better correlated with the actual systems ranking than the rankings produced by queries that were selected using the considered baseline methods.

References

[1]

J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. TREC 2007 million query track. In Notebook Proceedings of TREC 2007. TREC, 2007.

[2]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval}, pages 541--548, New York, NY, USA, 2006. ACM.

Digital Library

[3]

P. Billingsley. Probability and Measure. New York: Wiley, New York, NY, USA, 1995.

[4]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 268--275, New York, NY, USA, 2006. ACM.

Digital Library

[5]

B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation over thousands of queries. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 651--658, New York, NY, USA, 2008. ACM.

Digital Library

[6]

B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 539--546, New York, NY, USA, 2010. ACM.

Digital Library

[7]

G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98, pages 282--289, New York, NY, USA, 1998. ACM.

Digital Library

[8]

C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273--297, Sept. 1995.

[9]

F. Diaz. Performance prediction using spatial autocorrelation. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pages 583--590, New York, NY, USA, 2007. ACM.

Digital Library

[10]

J. Guiver, S. Mizzaro, and S. Robertson. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst., 27(4), 2009.

Digital Library

[11]

C. Hauff, D. Hiemstra, L. Azzopardi, and F. de Jong. A case for automatic system evaluation. In Proceedings of the 32nd European conference on Advances in Information Retrieval, ECIR'2010, pages 153--165, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[12]

M. Hosseini, I. J. Cox, N. Milic-Frayling, G. Kazai, and V. Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In ECIR'12: Proceedings of the 34th European conference on Advances in information retrieval, ECIR'12, pages 182--194, 2012.

Digital Library

[13]

M. Hosseini, I. J. Cox, N. Milic-Frayling, T. Sweeting, and V. Vinay. Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th ACM international conference on Information and knowledge management}, CIKM '11, pages 641--646, New York, NY, USA, 2011. ACM.

Digital Library

[14]

M. Hosseini, I. J. Cox, N. Milic-Frayling, V. Vinay, and T. Sweeting. Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the Third international conference on Advances in information retrieval theory, ICTIR'11, pages 113--124, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

[15]

S. Mizzaro and S. Robertson. Hits hits TREC: exploring IR evaluation results with network analysis. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pages 479--486, New York, NY, USA, 2007. ACM.

Digital Library

[16]

J. C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, 2000.

[17]

S. Robertson. On the contributions of topics to system evaluation. In Proceedings of the 33rd European conference on Advances in information retrieval}, ECIR'11, pages 129--140, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

[18]

I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '01, pages 66--73, New York, NY, USA, 2001. ACM.

Digital Library

[19]

K. Sparck Jones and K. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59--75, 1976.

[20]

E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 102--111, New York, NY, USA, 2006. ACM.

Digital Library

[21]

J. Zobel. How reliable are the results of large-scale information retrieval experiments? In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval}, pages 307--314, New York, NY, USA, 1998. ACM.

Digital Library

Cited By

Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916
Hasanain MBarkallah YSuwaileh RKutlu MElsayed THuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)ArTest: The First Test Collection for Arabic Web Search with Relevance RationalesProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401223(2017-2020)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401223
Roitero KCulpepper JSanderson MScholer FMizzaro S(2020)Fewer topics? A million topics? Both?! On topics subsets in test collectionsInformation Retrieval10.1007/s10791-019-09357-w23:1(49-85)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10791-019-09357-w
Show More Cited By

Index Terms

An uncertainty-aware query selection model for evaluation of IR systems
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Novel methods for query selection and query combination in query-by-example spoken term detection
SSCS '10: Proceedings of the 2010 international workshop on Searching spontaneous conversational speech

Query-by-example (QbE) spoken term detection (STD) is necessary for low-resource scenarios where training material is hardly available and word-based speech recognition systems cannot be employed. We present two novel contributions to QbE STD: the first ...
Query clustering and IR system detection: experiments on TREC data
RIAO '07: Large Scale Semantic Access to Content (Text, Image, Video, and Sound)

This paper investigates two aspects in this experiment. Linguistic techniques are used to categorize queries in a first step. This classification is then used to analyze systems performances in a TREC context. More precisely, we cluster TREC topics with ...
Query expansion based on term selection for Hindi – English cross lingual IR
Abstract
Retrieving accurate information from collection of information available on web in a cross-lingual communication environment is a very difficult task in our world. In order to retrieve information, user specifies the needed information in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

August 2012

1236 pages

ISBN:9781450314725

DOI:10.1145/2348283

General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '12

Sponsor:

SIGIR

SIGIR '12: The 35th International ACM SIGIR conference on research and development in Information Retrieval

August 12 - 16, 2012

Oregon, Portland, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
285
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916
Hasanain MBarkallah YSuwaileh RKutlu MElsayed THuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)ArTest: The First Test Collection for Arabic Web Search with Relevance RationalesProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401223(2017-2020)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401223
Roitero KCulpepper JSanderson MScholer FMizzaro S(2020)Fewer topics? A million topics? Both?! On topics subsets in test collectionsInformation Retrieval10.1007/s10791-019-09357-w23:1(49-85)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10791-019-09357-w
Rahman MKutlu MLease M(2019)Constructing Test Collections using Multi-armed Bandits and Active LearningThe World Wide Web Conference10.1145/3308558.3313675(3158-3164)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313675
Voorhees E(2019)The Evolution of CranfieldInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_2(45-69)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_2
Gupta SKutlu MKhetan VLease M(2019)Correlation, Prediction and Ranking of Evaluation Metrics in Information RetrievalAdvances in Information Retrieval10.1007/978-3-030-15712-8_41(636-651)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15712-8_41
Voorhees ECuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)On Building Fair and Reusable Test Collections using Bandit TechniquesProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271766(407-416)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271766
Melucci M(2018)A characterization of sample selection bias in system evaluation and the case of information retrievalInternational Journal of Data Science and Analytics10.1007/s41060-018-0134-x6:2(131-146)Online publication date: 5-Jul-2018
https://doi.org/10.1007/s41060-018-0134-x
Ma QHe BXu JOssowski S(2016)Direct measurement of training query quality for learning to rankProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851693(1035-1040)Online publication date: 4-Apr-2016
https://dl.acm.org/doi/10.1145/2851613.2851693
Melucci M(2016)Impact of Query Sample Selection Bias on Information Retrieval System Ranking2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2016.43(341-350)Online publication date: Oct-2016
https://doi.org/10.1109/DSAA.2016.43
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten