Skip to main content
Log in

Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

For decades, the use of test collection has been a standardized approach in information retrieval evaluation. However, given the intrinsic nature of its construction, this approach has a number of limitations, such as bias in pooling, disagreement between human assessors, different levels of difficulty of topics, and performance constraints of the evaluation metrics. Any of these factors may distort the results of the relative effectiveness of different retrieval strategies, or rather the retrieval systems and thus result in unreliable system rankings. In this study, we have suggested techniques in estimating the reliability of the retrieval system effectiveness rank based on rankings from multiple experiments. These rankings may be from previous experimental results or rankings generated by conducting multiple experiments using smaller number of topics. These techniques will assist in precisely predicting the performance of each system in future experiments. To validate the proposed rank reliability estimation methods, two alternative systems ranking methods are proposed to generate new system rankings. The experimentation shows that system rank correlation coefficient values mostly remain above 0.8 against the gold standard. On top of that, the proposed techniques have generated system rankings that are more reliable than the baseline [traditional system ranking techniques used in text retrieval conference (TREC)-like initiatives]. The results from both TREC-2004 and TREC-8 show the same outcome which further confirms the effectiveness of the proposed rank reliability estimation method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://trec.nist.gov/.

References

  1. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  2. Ravana, S.D., Rajagopal, P., Balakrishnan, V.: Ranking retrieval systems using pseudo relevance judgments. Aslib J. Inf. Manag. 67(6), 700–714 (2015)

    Article  Google Scholar 

  3. Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370. Springer, (2001)

  4. Tonon, A., Demartini, G., Cudr-Mauroux, P.: Pooling-based continuous evaluation of information retrieval systems. Inf. Retr. J. 18(5), 445–472 (2015)

    Article  Google Scholar 

  5. Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)

    Article  Google Scholar 

  6. Jayasinghe, G.K., Webber, W., Sanderson, M., Culpepper, J.S.: Improving test collection pools with machine learning. In: Proceedings of the 2014 Australasian Document Computing Symposium, p. 2. ACM, (2014)

  7. Losada, D.E., Parapar, J., Barreiro, A.: Feeling lucky? Multi-armed bandits for ordering judgements in pooling-based evaluation. Paper presented at the proceedings of the 31st annual ACM symposium on applied computing. (2016)

  8. Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. Paper presented at the proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. (2001)

  9. Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manag. 42(3), 595–614 (2006)

    Article  MATH  Google Scholar 

  10. Hauff, C., Hiemstra, D., Azzopardi, L., De Jong, F.: A case for automatic system evaluation. Advances in Information Retrieval, pp. 153–165. Springer, (2010)

  11. Gao, N., Webber, W., Oard, D.W.: Reducing reliance on relevance judgments for system comparison by using expectation-maximization. Advances in Information Retrieval, pp. 1–12. Springer, (2014)

  12. Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Develder, C.: Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation. Inf. Retr. J. 19(3), 284–312 (2016). doi:10.1007/s10791-015-9275-x

  13. Trotman, A., Jenkinson, D.: IR evaluation using multiple assessors per topic. In: Proceedings of ADCS. (2007)

  14. Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)

  15. Moffat, A., Scholer, F., Thomas, P., Bailey, P.: Pooled evaluation over query variations: users are as diverse as systems. Paper presented at the proceedings of the 24th ACM international on conference on information and knowledge management. (2015)

  16. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. Paper presented at the proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. (2004)

  17. Sakai, T.: Alternatives to bpref. Paper presented at the proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. (2007)

  18. Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. Paper presented at the proceedings of the 15th ACM international conference on information and knowledge management. (2006)

  19. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11(5), 447–470 (2008)

  20. Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)

  21. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)

  22. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2 (2008)

    Article  Google Scholar 

  23. Moffat, A., Bailey, P., Scholer, F., Thomas, P.: INST: an adaptive metric for information retrieval evaluation. Paper presented at the proceedings of the 20th Australasian document computing symposium. (2015)

  24. Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)

  25. Lipani, A., Lupu, M., Hanbury, A.: Splitting water: precision and anti-precision to reduce pool bias. Paper presented at the proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. (2015)

  26. Ravana, S.D., Moffat, A.: Score aggregation techniques in retrieval experimentation. Paper presented at the proceedings of the twentieth Australasian conference on Australasian database, vol. 92. (2009)

  27. Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1), 11–20 (2005). doi:10.1145/1067268.1067272

    Article  Google Scholar 

  28. Voorhees, E.M., Harman, D.: The text REtrieval conference (TREC): history and plans for TREC-9. Paper presented at the ACM SIGIR forum. (1999)

  29. Moffat, A., Scholer, F., Thomas, P.: Models and metrics: IR evaluation as a user process. Paper presented at the proceedings of the seventeenth Australasian document computing symposium. (2012)

  30. Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization advances in information retrieval, pp. 768–780. Springer, (2015)

Download references

Acknowledgements

This research was supported by UMRG RP028E-14AET and the Exploratory Research Grant Scheme (ERGS) ER027-2013A.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuxiang Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, S., Ravana, S.D. Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments. Cluster Comput 20, 925–940 (2017). https://doi.org/10.1007/s10586-016-0709-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0709-z

Keywords

Navigation