Abstract
Relevance judgments are often the most expensive part of information retrieval evaluation, and techniques for comparing retrieval systems using fewer relevance judgments have received significant attention in recent years. This paper proposes a novel system comparison method using an expectation-maximization algorithm. In the expectation step, real-valued pseudo-judgments are estimated from a set of system results. In the maximization step, new system weights are learned from a combination of a limited number of actual human judgments and system pseudo-judgments for the other documents. The method can work without any human judgments, and is able to improve its accuracy by incrementally adding human judgments. Experiments using TREC Ad Hoc collections demonstrate strong correlations with system rankings using pooled human judgments, and comparison with existing baselines indicates that the new method achieves the same comparison reliability with fewer human judgments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aslam, J., Pavlu, V., Savell, R.: A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In: Proc. 26th Annual International ACM SIGIR, pp. 393–394 (2003)
Aslam, J., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. 29th Annual International ACM SIGIR, pp. 541–548 (2006)
Buckley, C., Voorhees, E.: Retrieval evaluation with incomplete information. In: Proc. 27th Annual International ACM SIGIR, pp. 25–32 (2004)
Carterette, B.: Robust test collections for retrieval evaluation. In: Proc. 30th Annual International ACM SIGIR, pp. 55–62 (2007)
Carterette, B., Allan, J.: Incremental test collections. In: Proc. 14th ACM International Conference on Information and Knowledge Management, pp. 680–687 (2005)
Cormack, G.V., Palmer, C.R., Clarke, C.L.: Efficient construction of large test collections. In: Proc. 21st Annual International ACM SIGIR, pp. 282–289 (1998)
Dai, K., Pavlu, V., Kanoulas, E., Aslam, J.A.: Extended expectation maximization for inferring score distributions. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 293–304. Springer, Heidelberg (2012)
Hauff, C., Hiemstra, D., Azzopardi, L., de Jong, F.: A case for automatic system evaluation. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 153–165. Springer, Heidelberg (2010)
Hosseini, M., Cox, I.J., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 182–194. Springer, Heidelberg (2012)
Kendall, M.G.: Rank Correlation Methods, 1st edn. Charles Griffin, London (1948)
Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Information Processing & Management 42(3), 595–614 (2006)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proc. 24th Annual International ACM SIGIR, pp. 66–73 (2001)
Spärck Jones, K., van Rijsbergen, C.J.: Report on the need for and provision of an ‘ideal’ test collection. Tech. rep., University Computer Laboratory, Cambridge (1975)
Yilmaz, E., Aslam, J.: Estimating average precision with incomplete and imperfect judgments. In: Proc. 15th ACM International Conference on Information and Knowledge Management, pp. 102–111 (2006)
Yilmaz, E., Aslam, J., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proc. 31st Annual International ACM SIGIR, pp. 587–594 (2008a)
Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating ap and ndcg. In: Proc. 31st Annual International ACM SIGIR, pp. 603–610 (2008b)
Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proc. 21st Annual International ACM SIGIR, pp. 307–314 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gao, N., Webber, W., Oard, D.W. (2014). Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)