Skip to main content

Towards Nuanced System Evaluation Based on Implicit User Expectations

  • Conference paper
  • First Online:
Information Retrieval Technology (AIRS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Included in the following conference series:

  • 833 Accesses

Abstract

Information retrieval systems are often evaluated through the use of effectiveness metrics. In the past, the metrics used have corresponded to fixed models of user behavior, presuming, for example, that the user will view a pre-determined number of items in the search engine results page, or that they have a constant probability of advancing from one item in the result page to the next. Recently, a number of proposals for models of user behavior have emerged that are parameterized in terms of the number of relevant documents (or other material) a user expects to be required to address their information need. That recent work has demonstrated that T, the user’s a priori utility expectation, is correlated with the underlying nature of the information need; and hence that evaluation metrics should be sensitive to T. Here we examine the relationship between the query the user issues, and their anticipated T, seeking syntactic and other clues to guide the subsequent system evaluation. That is, we wish to develop mechanisms that, based on the query alone, can be used to adjust system evaluations so that the experience of the user of the system is better captured in the system’s effectiveness score, and hence can be used as a more refined way of comparing systems. This paper reports on a first round of experimentation, and describes the progress (albeit modest) that we have achieved towards that goal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://trec.nist.gov.

  2. 2.

    The backstories are available for reuse at DOI 10.4225/08/55D0B6A098248.

  3. 3.

    http://www.crowdflower.com.

  4. 4.

    Cumulative logistic regression – also known as ordinal regression – used R’s ordinal::clm and ordinal::step.clm functions.

References

  1. The roar of the crowd. The Economist (2012)

    Google Scholar 

  2. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  3. Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR Workshop. Future IR Evaluation, pp. 15–16 (2009)

    Google Scholar 

  4. Anderson, L.W., Krathwohl, D.A.: A Taxonomy for Learning, Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York (2001)

    Google Scholar 

  5. Bailey, P., Moffat, A., Scholer, F., Thomas, P.: User variability and IR system evaluation. In: Proceedings of SIGIR, pp. 625–634 (2015)

    Google Scholar 

  6. Bennett, P.N., White, R.W., Chu, W., Dumais, S.T., Bailey, P., Borisyuk, F., Cui, X.: Modeling the impact of short-and long-term behavior on search personalization. In: Proceedings of SIGIR, pp. 185–194 (2012)

    Google Scholar 

  7. Buckley, C., Walz, J.: The TREC-8 query track. In: Proceedings of TREC 1999. NIST Special Publication 500–246 (1999)

    Google Scholar 

  8. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of CIKM, pp. 89–96 (2005)

    Google Scholar 

  9. Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of CIKM, pp. 621–630 (2009)

    Google Scholar 

  10. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)

    Article  Google Scholar 

  11. Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: Proceedings of SIGIR, pp. 154–161 (2005)

    Google Scholar 

  12. Kelly, D., Arguello, J., Edwards, A., Wu, W.C.: Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In: Proceeding of ICTIR (2015)

    Google Scholar 

  13. Lin, S.J., Belkin, N.: Validation of a model of information seeking over multiple search sessions. J. Am. Soc. Inf. Sci. Technol. 56(4), 393–415 (2005)

    Article  Google Scholar 

  14. Moffat, A., Thomas, P., Scholer, F.: Users versus models: what observation tells us about effectiveness metrics. In: Proceedings of CIKM, pp. 659–668 (2013)

    Google Scholar 

  15. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2:1–2:27 (2008)

    Article  Google Scholar 

  16. Phan, N., Bailey, P., Wilkinson, R.: Understanding the relationship of information need specificity to search query length. In: Proceedings of SIGIR, pp. 709–710 (2007)

    Google Scholar 

  17. Smucker, M.D., Clarke, C.L.A.: Time-based calibration of effectiveness measures. In: Proceedings of SIGIR, pp. 95–104 (2012)

    Google Scholar 

  18. Smucker, M., Kazai, G., Lease, M.: The TREC-12 crowdsourcing track. In: Proceedings of TREC 2012. NIST Special Publication 500–298 (2012)

    Google Scholar 

  19. Sormunen, E.: Liberal relevance criteria of TREC: counting on negligible documents? In: Proceedings of SIGIR, pp. 324–330 (2002)

    Google Scholar 

  20. Teevan, J., Dumais, S.T., Liebling, D.J.: To personalize or not to personalize: modeling queries with variation in user intent. In: Proceedings of SIGIR, pp. 163–170 (2008)

    Google Scholar 

  21. Thomas, P., Scholer, F., Moffat, A.: What users do: the eyes have it. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 416–427. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  22. Wu, W.C., Kelly, D., Edwards, A., Arguello, J.: Grannies, tanning beds, tattoos and NASCAR: evaluation of search tasks with varying levels of cognitive complexity. In: Proceedings of IIiX, pp. 254–257 (2012)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Australian Research Council’s Discovery Projects Scheme (projects DP110101934 and DP140102655). We thank Xiaolu Lu for assistance with the data collection and Bodo von Billerbeck for assistance with query log mining.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alistair Moffat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Thomas, P., Bailey, P., Moffat, A., Scholer, F. (2015). Towards Nuanced System Evaluation Based on Implicit User Expectations. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28940-3_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28939-7

  • Online ISBN: 978-3-319-28940-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics