Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments

Zhang, Shuxiang; Ravana, Sri Devi

doi:10.1007/s10586-016-0709-z

Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments

Published: 20 December 2016

Volume 20, pages 925–940, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Shuxiang Zhang¹ &
Sri Devi Ravana¹

280 Accesses
Explore all metrics

Abstract

For decades, the use of test collection has been a standardized approach in information retrieval evaluation. However, given the intrinsic nature of its construction, this approach has a number of limitations, such as bias in pooling, disagreement between human assessors, different levels of difficulty of topics, and performance constraints of the evaluation metrics. Any of these factors may distort the results of the relative effectiveness of different retrieval strategies, or rather the retrieval systems and thus result in unreliable system rankings. In this study, we have suggested techniques in estimating the reliability of the retrieval system effectiveness rank based on rankings from multiple experiments. These rankings may be from previous experimental results or rankings generated by conducting multiple experiments using smaller number of topics. These techniques will assist in precisely predicting the performance of each system in future experiments. To validate the proposed rank reliability estimation methods, two alternative systems ranking methods are proposed to generate new system rankings. The experimentation shows that system rank correlation coefficient values mostly remain above 0.8 against the gold standard. On top of that, the proposed techniques have generated system rankings that are more reliable than the baseline [traditional system ranking techniques used in text retrieval conference (TREC)-like initiatives]. The results from both TREC-2004 and TREC-8 show the same outcome which further confirms the effectiveness of the proposed rank reliability estimation method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rank-Biased Precision Reloaded: Reproducibility and Generalization

Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

Notes

http://trec.nist.gov/.

References

Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Ravana, S.D., Rajagopal, P., Balakrishnan, V.: Ranking retrieval systems using pseudo relevance judgments. Aslib J. Inf. Manag. 67(6), 700–714 (2015)
Article Google Scholar
Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370. Springer, (2001)
Tonon, A., Demartini, G., Cudr-Mauroux, P.: Pooling-based continuous evaluation of information retrieval systems. Inf. Retr. J. 18(5), 445–472 (2015)
Article Google Scholar
Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)
Article Google Scholar
Jayasinghe, G.K., Webber, W., Sanderson, M., Culpepper, J.S.: Improving test collection pools with machine learning. In: Proceedings of the 2014 Australasian Document Computing Symposium, p. 2. ACM, (2014)
Losada, D.E., Parapar, J., Barreiro, A.: Feeling lucky? Multi-armed bandits for ordering judgements in pooling-based evaluation. Paper presented at the proceedings of the 31st annual ACM symposium on applied computing. (2016)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. Paper presented at the proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. (2001)
Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manag. 42(3), 595–614 (2006)
Article MATH Google Scholar
Hauff, C., Hiemstra, D., Azzopardi, L., De Jong, F.: A case for automatic system evaluation. Advances in Information Retrieval, pp. 153–165. Springer, (2010)
Gao, N., Webber, W., Oard, D.W.: Reducing reliance on relevance judgments for system comparison by using expectation-maximization. Advances in Information Retrieval, pp. 1–12. Springer, (2014)
Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Develder, C.: Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation. Inf. Retr. J. 19(3), 284–312 (2016). doi:10.1007/s10791-015-9275-x
Trotman, A., Jenkinson, D.: IR evaluation using multiple assessors per topic. In: Proceedings of ADCS. (2007)
Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)
Moffat, A., Scholer, F., Thomas, P., Bailey, P.: Pooled evaluation over query variations: users are as diverse as systems. Paper presented at the proceedings of the 24th ACM international on conference on information and knowledge management. (2015)
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. Paper presented at the proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. (2004)
Sakai, T.: Alternatives to bpref. Paper presented at the proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. (2007)
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. Paper presented at the proceedings of the 15th ACM international conference on information and knowledge management. (2006)
Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11(5), 447–470 (2008)
Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2 (2008)
Article Google Scholar
Moffat, A., Bailey, P., Scholer, F., Thomas, P.: INST: an adaptive metric for information retrieval evaluation. Paper presented at the proceedings of the 20th Australasian document computing symposium. (2015)
Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)
Lipani, A., Lupu, M., Hanbury, A.: Splitting water: precision and anti-precision to reduce pool bias. Paper presented at the proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. (2015)
Ravana, S.D., Moffat, A.: Score aggregation techniques in retrieval experimentation. Paper presented at the proceedings of the twentieth Australasian conference on Australasian database, vol. 92. (2009)
Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1), 11–20 (2005). doi:10.1145/1067268.1067272
Article Google Scholar
Voorhees, E.M., Harman, D.: The text REtrieval conference (TREC): history and plans for TREC-9. Paper presented at the ACM SIGIR forum. (1999)
Moffat, A., Scholer, F., Thomas, P.: Models and metrics: IR evaluation as a user process. Paper presented at the proceedings of the seventeenth Australasian document computing symposium. (2012)
Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization advances in information retrieval, pp. 768–780. Springer, (2015)

Download references

Acknowledgements

This research was supported by UMRG RP028E-14AET and the Exploratory Research Grant Scheme (ERGS) ER027-2013A.

Author information

Authors and Affiliations

Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Shuxiang Zhang & Sri Devi Ravana

Authors

Shuxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sri Devi Ravana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuxiang Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, S., Ravana, S.D. Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments. Cluster Comput 20, 925–940 (2017). https://doi.org/10.1007/s10586-016-0709-z

Download citation

Received: 22 June 2016
Revised: 16 November 2016
Accepted: 02 December 2016
Published: 20 December 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10586-016-0709-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments

Abstract

Access this article

Similar content being viewed by others

Rank-Biased Precision Reloaded: Reproducibility and Generalization

Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments

Abstract

Access this article

Similar content being viewed by others

Rank-Biased Precision Reloaded: Reproducibility and Generalization

Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation