skip to main content
research-article

Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness

Published: 05 June 2017 Publication History

Abstract

Information retrieval systems aim to help users satisfy information needs. We argue that the goal of the person using the system, and the pattern of behavior that they exhibit as they proceed to attain that goal, should be incorporated into the methods and techniques used to evaluate the effectiveness of IR systems, so that the resulting effectiveness scores have a useful interpretation that corresponds to the users’ search experience. In particular, we investigate the role of search task complexity, and show that it has a direct bearing on the number of relevant answer documents sought by users in response to an information need, suggesting that useful effectiveness metrics must be goal sensitive. We further suggest that user behavior while scanning results listings is affected by the rate at which their goal is being realized, and hence that appropriate effectiveness metrics must be adaptive to the presence (or not) of relevant documents in the ranking. In response to these two observations, we present a new effectiveness metric, INST, that has both of the desired properties: INST employs a parameter T, a direct measure of the user’s search goal that adjusts the top-weightedness of the evaluation score; moreover, as progress towards the target T is made, the modeled user behavior is adapted, to reflect the remaining expectations. INST is experimentally compared to previous effectiveness metrics, including Average Precision (AP), Normalized Discounted Cumulative Gain (NDCG), and Rank-Biased Precision (RBP), demonstrating our claims as to INST’s usefulness. Like RBP, INST is a weighted-precision metric, meaning that each score can be accompanied by a residual that quantifies the extent of the score uncertainty caused by unjudged documents. As part of our experimentation, we use crowd-sourced data and score residuals to demonstrate that a wide range of queries arise for even quite specific information needs, and that these variant queries introduce significant levels of residual uncertainty into typical experimental evaluations. These causes of variability have wide-reaching implications for experiment design, and for the construction of test collections.

References

[1]
N. Alemayehu. 2003. Analysis of performance variation using query expansion. Journal of the American Society for Information Science and Technology 54, 5 (2003), 379--391.
[2]
O. Alonso, D. E. Rose, and B. Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (2008), 9--15.
[3]
L. W. Anderson and D. A. Krathwohl. 2001. A Taxonomy for Learning, Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman.
[4]
L. Azzopardi, D. Kelly, and K. Brennan. 2013. How query cost affects search behavior. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’13). 23--32.
[5]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter? In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’08). 667--674.
[6]
P. Bailey, N. Craswell, R. W. White, L. Chen, A. Satyanarayana, and S. M. M. Tahaghoghi. 2010. Evaluating whole-page relevance. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’10). 767--768.
[7]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2015. User variability and IR system evaluation. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’15). 625--634.
[8]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2016. UQV100: A test collection with query variability. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’16). 725--728. Public data: http://dx.doi.org/10.4225/49/5726E597B8376.
[9]
D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Information Retrieval 1, 1--2 (1999), 7--34.
[10]
F. Baskaya, H. Keskustalo, and K. Järvelin. 2013. Modeling behavioral factors in interactive information retrieval. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’13). 2297--2302.
[11]
N. J. Belkin, C. Cool, W. B. Croft, and J. P. Callan. 1993. Effect of multiple query representations on information retrieval system performance. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’93). 339--346.
[12]
N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. 1995. Combining the evidence of multiple query representations for information retrieval. Information Processing 8 Management 31, 3 (1995), 431--448.
[13]
D. J. Bell and I. Ruthven. 2004. Searchers’ assessments of task complexity for web searching. In Proc. European Conf. in Information Retrieval (ECIR’04). 57--71.
[14]
P. Borlund. 2003. The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research 8, 3 (2003).
[15]
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. 2007. Bias and the limits of pooling for large collections. Information Retrieval 10, 6 (2007), 491--508.
[16]
C. Buckley and J. Walz. 1999. The TREC-8 query track. In Proc. Text Retrieval Conf. (TREC’99). NIST Special Publication 500-246.
[17]
S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’07). 63--70.
[18]
K. Byström and K. Järvelin. 1995. Task complexity affects information seeking and use. Information Processing 8 Management 31, 2 (1995), 191--213.
[19]
B. Carterette, E. Kanoulas, and E. Yilmaz. 2012. Incorporating variability in user behavior into systems based evaluation. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM). 135--144.
[20]
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’09). 621--630.
[21]
A. Chuklin, I. Markov, and M. de Rijke. 2015. Click Models for Web Search. Morgan 8 Claypool.
[22]
C. L. A. Clarke, N. Craswell, and I. Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proc. Text Retrieval Conf. (TREC’04).
[23]
W. S. Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation 19, 1 (1968), 30--41.
[24]
A. P. de Vries, G. Kazai, and M. Lalmas. 2004. Tolerance to irrelevance: A user-effort evaluation of retrieval systems without predefined retrieval unit. In Proc. Recherche d’Information Etses Applications (RIAO’04). 463--473.
[25]
S. T. Dumais, G. Buscher, and E. Cutrell. 2010. Individual differences in gaze patterns for web search. In Proceedings of the 3rd Symposium on Information Interaction in Context. ACM, 185--194.
[26]
N. Ferro and G. Silvello. 2016. A general linear mixed models approach to study system component effects. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’16). 25--34.
[27]
J. L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychology Bulletin 76, 5 (1971), 378.
[28]
K. Fujikawa, H. Joho, and S. Nakayama. 2012. Constraint can affect human perception, behaviour, and performance of search. In Proc. International Conf. Asia-Pacific Digital Libraries (ICADL’12). 39--48.
[29]
J. Gwizdka and I. Spence. 2006. What can searching behavior tell us about the difficulty of information tasks? A study of web navigation. Proceedings of the American Society for Information Science and Technology 43, 1 (2006), 1--22.
[30]
D. K. Harman. 2005. The TREC test collections. In TREC: Experiment and Evaluation in Information Retrieval, E. M. Voorhees and D. K. Harman (Eds.). MIT Press, Chapter 2, 21--52.
[31]
K. Järvelin and J. Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422--446.
[32]
J. Jiang and J. Allan. 2016. Adaptive effort for search evaluation metrics. In Proc. European Conf. in Information Retrieval (ECIR’16). 187--199.
[33]
G. Kazai, N. Craswell, E. Yilmaz, and S. M. M. Tahaghoghi. 2012. An analysis of systematic judging errors in information retrieval. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’12). 105--114.
[34]
D. Kelly, J. Arguello, A. Edwards, and W.-C. Wu. 2015. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In Proc. International Conf. on Theory of Information Retrieval (ICTIR’15). 101--110.
[35]
K. A. Kinney, S. B Huffman, and J. Zhai. 2008. How evaluator domain expertise affects search result relevance judgments. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’08). 591--598.
[36]
D. R. Krathwohl. 2002. A revision of Bloom’s taxonomy: An overview. Theory Into Practice 41, 4 (2002), 212--218.
[37]
G. Kumaran and J. Allan. 2008. Adapting information retrieval systems to user queries. Information Processing 8 Management 44, 6 (2008), 1838--1862.
[38]
X. Lu, A. Moffat, and J. S. Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval 19, 4 (2016), 416--445.
[39]
D. Maxwell, L. Azzopardi, K. Järvelin, and H. Keskustalo. 2015. Searching and stopping: An analysis of stopping rules and strategies. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’15). 313--322.
[40]
D. Metzler and W. B. Croft. 2005. A Markov random field model for term dependencies. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’05). 472--479.
[41]
F. Modave, N. K. Shokar, E. Peñaranda, and N. Nguyen. 2014. Analysis of the accuracy of weight loss information search engine results on the internet. American Journal of Public Health 104, 10 (2014), 1971--1978.
[42]
A. Moffat. 2013. Seven numeric properties of effectiveness metrics. In Proc. Asia Information Retrieval Societies Conf. (AIRS’13). 1--12.
[43]
A. Moffat. 2016. Judgment pool effects caused by query variations. In Proc. Australasian Document Computing Symp. (ADCS’16). 65--68.
[44]
A. Moffat, P. Bailey, F. Scholer, and P. Thomas. 2015. INST: An adaptive metric for information retrieval evaluation. In Proc. Australasian Document Computing Symp. (ADCS’15). 5:1--5:4.
[45]
A. Moffat, F. Scholer, and P. Thomas. 2012. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp. (ADCS’12). 47--54.
[46]
A. Moffat, F. Scholer, P. Thomas, and P. Bailey. 2015. Pooled evaluation over query variations: Users are as diverse as systems. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’15). 1759--1762.
[47]
A. Moffat, P. Thomas, and F. Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’13). 659--668.
[48]
A. Moffat, W. Webber, and J. Zobel. 2007. Strategic system comparisons via targeted relevance judgments. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’07). 375--382.
[49]
A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008), 2.1--2.27.
[50]
J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanbury, G. J. F. Jones, M. Lupu, and P. Pecina. 2015. CLEF eHealth evaluation lab 2015, task 2: Retrieving information about medical symptoms. In Working Notes of the Conference and Labs of the Evaluation Forum (CLEF’15).
[51]
S. E. Robertson and E. Kanoulas. 2012. On per-topic variance in IR evaluation. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’12). 891--900.
[52]
T. Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’06). 525--532.
[53]
T. Sakai. 2016a. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’16). 5--14.
[54]
T. Sakai. 2016b. Topic set size design. Information Retrieval 19, 3 (2016), 256--283.
[55]
T. Sakai and N. Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11, 5 (2008), 447--470.
[56]
T. Saracevic. 1996. Relevance reconsidered. In Proc. Conf. Conceptions of Library and Information Science (COLIS’96). 201--218.
[57]
D. Sheldon, M. Shokouhi, M. Szummer, and N. Craswell. 2011. Lambdamerge: Merging the results of query reformulations. In Proc. Conf. on Web Search and Data Mining (WSDM’11). 795--804.
[58]
M. D. Smucker and C. L. A. Clarke. 2012a. Modeling user variance in time-biased gain. In Proc. Symp. Human-Computer IR. 1--10.
[59]
M. D. Smucker and C. L. A. Clarke. 2012b. Time-based calibration of effectiveness measures. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’12). 95--104.
[60]
K. Spärck Jones and R. G. Bates. 1977. Report on the Design Study for the “Ideal” Information Retrieval Test Collection. Technical Report 5428. Computer Laboratory, University of Cambridge. British Library Research and Development Report.
[61]
K. Spärck Jones and C. J. van Rijsbergen. 1975. Report on the Need For and the Provision of an “Ideal” Information Retrieval Test Collection. Technical Report 5266. Computer Laboratory, University of Cambridge. British Library Research and Development Report.
[62]
K. Sparck Jones, S. Walker, and S. E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments. Part 1. Information Processing 8 Management 36, 6 (2000), 779--808.
[63]
I. Stanton, S. Ieong, and N. Mishra. 2014. Circumlocution in diagnostic medical queries. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’14). 133--142.
[64]
P. Thomas, A. Moffat, P. Bailey, and F. Scholer. 2014. Modeling decision points in user search behavior. In Proc. Information Interaction in Context Symp (IIiX). 239--242.
[65]
P. Thomas, F. Scholer, and A. Moffat. 2013. What users do: The eyes have it. In Proc. Asia Information Retrieval Societies Conf. (AIRS’13). 416--427.
[66]
E. G. Toms, H. O’Brien, T. Mackenzie, C. Jordan, L. Freund, S. Toze, E. Dawe, and A. Macnutt. 2008. Task effects on interactive search: The query factor. In Focused Access to XML Documents. Springer, 359--372.
[67]
A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J. S. Culpepper. 2009. Including summaries in system evaluation. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’09). 508--515.
[68]
P. Vakkari. 1999. Task complexity, problem structure and information actions: Integrating studies on information seeking and retrieval. Information Processing 8 Management 35, 6 (1999), 819--837.
[69]
E. M Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing 8 Management 36, 5 (2000), 697--716.
[70]
E. M. Voorhees. 2002a. Overview of the TREC 2002 question answering track. In Proc. Text Retrieval Conf. (TREC’02).
[71]
E. M. Voorhees. 2002b. Overview of TREC 2002. In Proc. Text Retrieval Conf. (TREC’02).
[72]
E. M. Voorhees. 2003. Overview of the TREC 2003 robust retrieval track. In Proc. Text Retrieval Conf. (TREC’03).
[73]
W. Webber, A. Moffat, and J. Zobel. 2008. Statistical power in retrieval experimentation. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’08). 571--580.
[74]
W. Webber, A. Moffat, and J. Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems 28, 4 (2010), 20.1--20.38.
[75]
W. Webber, A. Moffat, J. Zobel, and T. Sakai. 2008. Precision-at-ten considered redundant. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’08). 695--696.
[76]
R. W. White and S. M. Drucker. 2007. Investigating behavioral variability in web search. In Proc. Conf. on the World Wide Web (WWW’07). ACM, 21--30.
[77]
R. W. White and D. Kelly. 2006. A study on the effects of personalization and task information on implicit feedback performance. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’06). 297--306.
[78]
I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kaufmann.
[79]
W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. 2012. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp (IIiX). 254--257.
[80]
W.-C. Wu, D. Kelly, and A. Sud. 2014. Using information scent and need for cognition to understand online search behavior. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’14). 557--566.
[81]
E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. 2010. Expected browsing utility for web search evaluation. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’10). 1561--1564.
[82]
E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. 2014. Relevance and effort: An analysis of document utility. In Proc. ACM International Conf. on Information and Knowledge Management (CIKM’14). 91--100.
[83]
J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR’98). 307--314.

Cited By

View all
  • (2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
  • (2024)Evaluating Relative Retrieval Effectiveness with Normalized Residual GainProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698410(64-71)Online publication date: 8-Dec-2024
  • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
  • Show More Cited By
  1. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 35, Issue 3
    July 2017
    410 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3026478
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2017
    Accepted: 01 November 2016
    Revised: 01 September 2016
    Received: 01 May 2016
    Published in TOIS Volume 35, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. User behavior
    2. effectiveness metric
    3. query
    4. relevance measures
    5. search
    6. test collections

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Australian Research Council's Discovery Projects Scheme

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)110
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
    • (2024)Evaluating Relative Retrieval Effectiveness with Normalized Residual GainProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698410(64-71)Online publication date: 8-Dec-2024
    • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
    • (2024)Evaluating Generative Ad Hoc Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657849(1916-1929)Online publication date: 10-Jul-2024
    • (2024)The Treatment of Ties in Rank-Biased OverlapProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657700(251-260)Online publication date: 10-Jul-2024
    • (2024)Tutorial on User Simulation for Evaluating Information Access Systems on the WebCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641243(1254-1257)Online publication date: 13-May-2024
    • (2024)User-oriented metrics for search engine deterministic sort ordersInformation Processing & Management10.1016/j.ipm.2023.10354761:1(103547)Online publication date: Jan-2024
    • (2024)An Intrinsic Framework of Information Retrieval Evaluation MeasuresIntelligent Systems and Applications10.1007/978-3-031-47721-8_47(692-713)Online publication date: 10-Jan-2024
    • (2024)Understanding users' dynamic perceptions of search gain and cost in sessions: An expectation confirmation modelJournal of the Association for Information Science and Technology10.1002/asi.24935Online publication date: 17-Jun-2024
    • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media