Abstract
This paper investigates the effect of performance measures and relevance functions in comparing retrieval systems in INEX, an evaluation forum dedicated to XML retrieval. We focus on two interdependent challenges which arise when evaluating XML retrieval systems, namely weak ordering issue of retrieved lists and multivalued relevance scales. Our analysis provides empirical evidence about the reasonableness of popular assumptions in information retrieval (IR) evaluation which state that ties can be ignored and binary relevance is sufficient. We also shed light on the impact of a parameter in Q-measure [18] on the sensitivity of the metric.
Chapter PDF
References
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: ACM SIGIR 2000, Athens, Greece, pp. 33–40. ACM Press, New York (2000)
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Sanderson, et al. (eds.) [19], pp. 25–32
Cooper, W.S.: Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems. American Documentation 19(1), 30–41 (1968)
Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Application. Cambridge University Press, Cambridge (1997)
de Vries, A.P., Kazai, G., Lalmas, M.: Evaluation metrics 2004. In: INEX 2004 Workshop Pre-Proceedings, pp. 249–250 (2004), Available at, http://inex.is.informatik.uni-duisburg.de:2004/pdf/INEX2004PreProceedings.pdf
de Vries, A.P., Kazai, G., Lalmas, M.: Tolerance to Irrelevance: A User-effort Oriented Evaluation of Retrieval Systems without Predefined Retrieval Unit. In: RIAO 2004, Avignon, France, pp. 463–473 (April 2004)
Hawking, D., Robertson, S.: On collection size and retrieval effectiveness. Information Retrieval 6(1), 99–105 (2003)
Hull, D.A., Kantor, P., Ng, K.: Advanced approaches to the statistical analysis of TREC information retrieval experiments. Technical report (1997), Unpublished, contact the first author for a copy: hull@clairvoyancecorp.com
Kazai, G., Lalmas, M., de Vries, A.P.: The overlap problem in content-oriented XML retrieval evaluation. In: Sanderson, et al. (eds.) [19], pp. 72–79
Kazai, G., Lalmas, M., de Vries, A.P.: Reliability Tests for the XCG and inex-2002 Metric. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 60–72. Springer, Heidelberg (2005)
Kazai, G., Lalmas, M., Fuhr, N., Gövert, N.: A report on the first year of the INitiative for the evaluation of XML retrieval (INEX 2002). Journal of the American Society for Information Science and Technology (JASIST) 55(6), 551–556 (2004)
Kekäläinen, J., Järvelin, K.: Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology (JASIST) 53(13), 1120–1129 (2002)
Kraaij, W.: Variations on Language Modeling for Information Retrieval. PhD thesis, University of Twente (2004)
Mea, V.D., Mizzaro, S.: Measuring retrieval effectiveness: a new proposal and a first experimental validation. Journal of the American Society for Information Science and Technology (JASIST) 55(6), 530–543 (2004)
Myaeng, S.H., Jang, D.-H., Kim, M.-S., Zhoo, Z.-C.: A Flexible Model for Retrieval of SGML documents. In: SIGIR 1998, Melbourne, Australia, pp. 138–140 (August 1998)
Raghavan, V.V., Jung, G.S., Bollmann, P.: A critical investigation of recall and precision as measures of retrieval system performance. ACM Transactions on Information Systems 7(3), 205–229 (1989)
Sakai, T.: New Performance metrics based on Multigrade Relevance: Their Application to Question Answering. In: NTCIR-4 Proceedings (2004)
Sakai, T.: Ranking the NTCIR Systems Based on Multigrade Relevance. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 251–262. Springer, Heidelberg (2005)
Sanderson, M., Järvelin, K., Allan, J., Bruza, P. (eds.) SIGIR 2004: Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Sheffield, UK, July 25-29 (2004)
Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: ACM SIGIR 2005 (2005) (to appear)
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Info. Process. Management 33(4), 495–512 (1997)
Soboroff, I.: On evaluating web search with very few relevant documents. In: Sanderson, et al. (eds.) [19], pp. 530–531
Tague-Sutcliffe, J., Blustein, J.: A statistical analysis of the TREC-3 data. In: Proceedings of TREC-3, NIST Special Publication 500-225, pp. 385–398 (April 1995)
Van Rijsbergen, C.J.: Information Retrieval, Butterworths (1979)
Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1), 11–20 (2005)
Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: ACM SIGIR 2002, pp. 316–323. ACM Press, New York (August 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vu, HT., Gallinari, P. (2005). On Effectiveness Measures and Relevance Functions in Ranking INEX Systems. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_24
Download citation
DOI: https://doi.org/10.1007/11562382_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)