Abstract
For decades, the use of test collection has been a standardized approach in information retrieval evaluation. However, given the intrinsic nature of its construction, this approach has a number of limitations, such as bias in pooling, disagreement between human assessors, different levels of difficulty of topics, and performance constraints of the evaluation metrics. Any of these factors may distort the results of the relative effectiveness of different retrieval strategies, or rather the retrieval systems and thus result in unreliable system rankings. In this study, we have suggested techniques in estimating the reliability of the retrieval system effectiveness rank based on rankings from multiple experiments. These rankings may be from previous experimental results or rankings generated by conducting multiple experiments using smaller number of topics. These techniques will assist in precisely predicting the performance of each system in future experiments. To validate the proposed rank reliability estimation methods, two alternative systems ranking methods are proposed to generate new system rankings. The experimentation shows that system rank correlation coefficient values mostly remain above 0.8 against the gold standard. On top of that, the proposed techniques have generated system rankings that are more reliable than the baseline [traditional system ranking techniques used in text retrieval conference (TREC)-like initiatives]. The results from both TREC-2004 and TREC-8 show the same outcome which further confirms the effectiveness of the proposed rank reliability estimation method.
Similar content being viewed by others
Notes
References
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Ravana, S.D., Rajagopal, P., Balakrishnan, V.: Ranking retrieval systems using pseudo relevance judgments. Aslib J. Inf. Manag. 67(6), 700–714 (2015)
Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370. Springer, (2001)
Tonon, A., Demartini, G., Cudr-Mauroux, P.: Pooling-based continuous evaluation of information retrieval systems. Inf. Retr. J. 18(5), 445–472 (2015)
Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)
Jayasinghe, G.K., Webber, W., Sanderson, M., Culpepper, J.S.: Improving test collection pools with machine learning. In: Proceedings of the 2014 Australasian Document Computing Symposium, p. 2. ACM, (2014)
Losada, D.E., Parapar, J., Barreiro, A.: Feeling lucky? Multi-armed bandits for ordering judgements in pooling-based evaluation. Paper presented at the proceedings of the 31st annual ACM symposium on applied computing. (2016)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. Paper presented at the proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. (2001)
Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manag. 42(3), 595–614 (2006)
Hauff, C., Hiemstra, D., Azzopardi, L., De Jong, F.: A case for automatic system evaluation. Advances in Information Retrieval, pp. 153–165. Springer, (2010)
Gao, N., Webber, W., Oard, D.W.: Reducing reliance on relevance judgments for system comparison by using expectation-maximization. Advances in Information Retrieval, pp. 1–12. Springer, (2014)
Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Develder, C.: Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation. Inf. Retr. J. 19(3), 284–312 (2016). doi:10.1007/s10791-015-9275-x
Trotman, A., Jenkinson, D.: IR evaluation using multiple assessors per topic. In: Proceedings of ADCS. (2007)
Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)
Moffat, A., Scholer, F., Thomas, P., Bailey, P.: Pooled evaluation over query variations: users are as diverse as systems. Paper presented at the proceedings of the 24th ACM international on conference on information and knowledge management. (2015)
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. Paper presented at the proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. (2004)
Sakai, T.: Alternatives to bpref. Paper presented at the proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. (2007)
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. Paper presented at the proceedings of the 15th ACM international conference on information and knowledge management. (2006)
Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11(5), 447–470 (2008)
Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. Paper presented at the proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. (2008)
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2 (2008)
Moffat, A., Bailey, P., Scholer, F., Thomas, P.: INST: an adaptive metric for information retrieval evaluation. Paper presented at the proceedings of the 20th Australasian document computing symposium. (2015)
Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. (2009)
Lipani, A., Lupu, M., Hanbury, A.: Splitting water: precision and anti-precision to reduce pool bias. Paper presented at the proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. (2015)
Ravana, S.D., Moffat, A.: Score aggregation techniques in retrieval experimentation. Paper presented at the proceedings of the twentieth Australasian conference on Australasian database, vol. 92. (2009)
Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1), 11–20 (2005). doi:10.1145/1067268.1067272
Voorhees, E.M., Harman, D.: The text REtrieval conference (TREC): history and plans for TREC-9. Paper presented at the ACM SIGIR forum. (1999)
Moffat, A., Scholer, F., Thomas, P.: Models and metrics: IR evaluation as a user process. Paper presented at the proceedings of the seventeenth Australasian document computing symposium. (2012)
Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization advances in information retrieval, pp. 768–780. Springer, (2015)
Acknowledgements
This research was supported by UMRG RP028E-14AET and the Exploratory Research Grant Scheme (ERGS) ER027-2013A.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, S., Ravana, S.D. Estimating reliability of the retrieval systems effectiveness rank based on performance in multiple experiments. Cluster Comput 20, 925–940 (2017). https://doi.org/10.1007/s10586-016-0709-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0709-z