Skip to main content
Log in

Estimating average precision when judgments are incomplete

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allan J (2004) HARD track overview in TREC 2004: High accuracy retrieval from documents. In: Proceedings of the 13th text retrieval conference (TREC 2004)

  2. Aslam JA, Pavlu V and Savell R (2003). A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In: Callan, J, Cormack, G, Clarke, C, Hawking, D, and Smeaton, A (eds) Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 393–394. ACM Press, New york

    Google Scholar 

  3. Aslam JA, Pavlu V, Savell R (2003) A unified model for metasearch, pooling, and system evaluation. In: Frieder O, Hammer J, Quershi S, Seligman L (eds) Proceedings of the 12th international conference on information and knowledge management. ACM Press, pp 484–491

  4. Aslam JA, Pavlu V, Yilmaz E (2006) A statistical method for system evaluation using incomplete judgments. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 541–548

  5. Buckley C (2006) ‘trec_eval’. http://trec.nist.gov/trec_eval/trec_eval.8.1.tar.gz

  6. Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 25–32

  7. Buttcher S, Clarke C, Soboroff I (2006) The TREC 2006 terabyte track. In: Proceedings of the 15th text REtrieval conference (TREC 2006)

  8. Carterette B, Allan J, Sitaraman R (2006) Minimal test collections for retrieval evaluation. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 268–275

  9. Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, San Francisco, pp 310–318

  10. Clarke CLA, Scholer F, Soboroff I (2005) The TREC 2005 terabyte track. In: Proceedings of the 14th Text REtrieval conference (TREC 2005)

  11. Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Croft, Moffat, van Rijsbergen, wilkinson and Zobel (1998), pp 282–289

  12. Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R, Zobel J (eds) (1998) In: Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, New York

  13. Harman D (1995). Overview of the third text REtreival conference (TREC-3). In: Harman, D (eds) Overview of the 3rd text REtrieval conference (TREC-3)’, pp 1–19. US Government Printing Office, Washington D.C., Gaithersburg

    Google Scholar 

  14. Hawking D and Robertson S (2003). On collection size and retrieval effectiveness. Info Retr 6(1): 99–105

    Article  Google Scholar 

  15. Kagolovsky Y and Moehr JR (2003). Current status of the evaluation of information retrieval. J Med Syst 27(5): 409–424

    Article  Google Scholar 

  16. Kraaij W, Over P, Smeaton A (2006) TRECVID 2006—an introduction. In: TREC video retrieval evaluation online proceedings

  17. Kukar M (2006). Quality assessment of individual classifications in machine learning and data mining. Knowl Info Syst 9(3): 364–384

    Article  Google Scholar 

  18. Raghavan V, Bollmann P and Jung GS (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Info Syst 7(3): 205–229

    Article  Google Scholar 

  19. Tombros A and van Rijsbergen CJ (2004). Query-sensitive similarity measures for information retrieval. Knowl Info Syst 6(5): 617–642

    Google Scholar 

  20. Voorhees EM (2001) Evaluation by highly relevant documents. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 74–82

  21. Voorhees EM (2002) The philosophy of information retrieval evaluation. In: CLEF ’01: revised papers from the 2nd workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems. Springer, London, pp 355–370

  22. Voorhees EM, Harman D (1999) Overview of the 7th text retrieval conference (TREC-7). In: Proceedings of the 7th text REtrieval conference (TREC-7)’, pp 1–24

  23. Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on information and knowledge management, ACM Press, New York

  24. Zobel J (1998) How reliable are the results of large-scale retrieval experiments?, In: Croft et al., pp 307–314

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emine Yilmaz.

Additional information

We gratefully acknowledge the support provided by NSF grants CCF-0418390 and IIS-0534482.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yilmaz, E., Aslam, J.A. Estimating average precision when judgments are incomplete. Knowl Inf Syst 16, 173–211 (2008). https://doi.org/10.1007/s10115-007-0101-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0101-7

Keywords

Navigation