skip to main content
10.1145/3477495.3532051acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Ranking Interruptus: When Truncated Rankings Are Better and How to Measure That

Published: 07 July 2022 Publication History

Abstract

Most of information retrieval effectiveness evaluation metrics assume that systems appending irrelevant documents at the bottom of the ranking are as effective as (or not worse than) systems that have a stopping criteria to 'truncate' the ranking at the right position to avoid retrieving those irrelevant documents at the end. It can be argued, however, that such truncated rankings are more useful to the end user. It is thus important to understand how to measure retrieval effectiveness in this scenario. In this paper we provide both theoretical and experimental contributions. We first define formal properties to analyze how effectiveness metrics behave when evaluating truncated rankings. Our theoretical analysis shows that de-facto standard metrics do not satisfy desirable properties to evaluate truncated rankings: only Observational Information Effectiveness (OIE) -- a metric based on Shannon's information theory -- satisfies them all. We then perform experiments to compare several metrics on nine TREC datasets. According to our experimental results, the most appropriate metrics for truncated rankings are OIE and a novel extension of Rank-Biased Precision that adds a user effort factor penalizing the retrieval of irrelevant documents.

References

[1]
Ameer Albahem, Damiano Spina, Falk Scholer, and Lawrence Cavedon. 2019. Meta-Evaluation of Dynamic Search: How Do Metrics Capture Topical Relevance, Diversity and User Effort?. In Proceedings of the 41st European Conference on Information Retrieval (ECIR '19). Springer International Publishing, Cham, 607-- 620. https://doi.org/10.1007/978--3-030--15712--8_39
[2]
Ameer Albahem, Damiano Spina, Falk Scholer, and Lawrence Cavedon. 2021. Component-Based Analysis of Dynamic Search Performance. ACM Trans. Inf. Syst. 40, 3, Article 61 (nov 2021), 47 pages. https://doi.org/10.1145/3483237
[3]
Enrique Amigó, Hui Fang, Stefano Mizzaro, and ChengXiang Zhai. 2018. Are We on the Right Track? An Examination of Information Retrieval Methodologies. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '18). Association for Computing Machinery, New York, NY, USA, 997--1000. https://doi.org/10.1145/3209978.3210131
[4]
Enrique Amigó, Fernando Giner, Julio Gonzalo, and Felisa Verdejo. 2020. On the foundations of similarity in information access. Inf. Retr. J. 23, 3 (2020), 216--254. https://doi.org/10.1007/s10791-020-09375-z
[5]
Enrique Amigó, Fernando Giner, Stefano Mizzaro, and Damiano Spina. 2018. A Formal Account of Effectiveness Evaluation and Ranking Fusion. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '18). Association for Computing Machinery, New York, NY, USA, 123--130. https://doi.org/10.1145/3234944.3234958
[6]
Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. 2011. Combining Evaluation Metrics via the Unanimous Improvement Ratio and Its Application to Clustering Tasks. J. Artif. Int. Res. 42, 1 (sep 2011), 689--718. https://doi.org/10. 5555/2208436.2208454
[7]
Enrique Amigó, Julio Gonzalo, Stefano Mizzaro, and Jorge Carrillo-de Albornoz. 2020. An Effectiveness Metric for Ordinal Classification: Formal Properties and Experimental Results. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL '20). Association for Computational Linguistics, Online, 3938--3949. https://doi.org/10.18653/v1/2020.acl-main.363
[8]
Enrique Amigó, Julio Gonzalo, and Felisa Verdejo. 2013. A General Evaluation Measure for Document Organization Tasks. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '13). Association for Computing Machinery, New York, NY, USA, 643--652. https://doi.org/10.1145/2484028.2484081
[9]
Enrique Amigó, Damiano Spina, and Jorge Carrillo-de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '18). Association for Computing Machinery, New York, NY, USA, 625--634. https://doi.org/10.1145/3209978.3210024
[10]
Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). Dagstuhl Reports 9, 11 (2020), 34--83. https://doi.org/10.4230/DagRep.9.11.34
[11]
Chris Buckley and Ellen M. Voorhees. 2004. Retrieval Evaluation with Incomplete Information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '04). Association for Computing Machinery, New York, NY, USA, 25--32. https://doi.org/10.1145/ 1008992.1009000
[12]
David Carmel, Elad Haramaty, Arnon Lazerson, Liane Lewin-Eytan, and Yoelle Maarek. 2020. Why Do People Buy Seemingly Irrelevant Items in Voice Product Search? On the Relation between Product Relevance and Customer Satisfaction in ECommerce. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM '20). Association for Computing Machinery, New York, NY, USA, 79--87. https://doi.org/10.1145/3336191.3371780
[13]
Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. In Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (ECIR'08). Springer-Verlag, Berlin, Heidelberg, 16--27. https://doi.org/10.5555/ 1793274.1793281
[14]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). Association for Computing Machinery, New York, NY, USA, 621--630. https://doi.org/10.1145/1645953. 1646033
[15]
Charles L.A. Clarke, Chengxi Luo, and Mark D. Smucker. 2021. Evaluation Measures Based on Preference Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21). Association for Computing Machinery, New York, NY, USA, 1534--1543. https://doi.org/10.1145/3404835.3462947
[16]
Charles L.A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20). Association for Computing Machinery, New York, NY, USA, 225--234. https://doi.org/10.1145/3340531.3411915
[17]
Cyril Cleverdon. 1967. The Cranfield Tests on Index Language Devices. In Aslib proceedings, Vol. 19. MCB UP Ltd, Emerald Publishing Ltd, 173--194. https://doi.org/10.1108/eb050097
[18]
Fabio Crestani, Stefano Mizzaro, and Ivan Scagnetto. 2017. Mobile Information Retrieval. Springer. https://doi.org/10.1007/978--3--319--60777--1
[19]
Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales. IEEE Access 9 (2021), 136182--136216. https://doi.org/10.1109/ACCESS.2021. 3116857
[20]
Marco Ferrante, Nicola Ferro, and Maria Maistro. 2015. Towards a Formal Framework for Utility-Oriented Measurements of Retrieval Effectiveness. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval (ICTIR '15). Association for Computing Machinery, New York, NY, USA, 21--30. https://doi.org/10.1145/2808194.2809452
[21]
Hans P. Frei and Peter Schäuble. 1991. Determining the Effectiveness of Retrieval Algorithms. Information Processing & Management 27, 2 (1991), 153--164. https: //doi.org/10.1016/0306--4573(91)90046-O
[22]
Fernando Giner, Enrique Amigó, and Felisa Verdejo. 2020. Integrating Learned and Explicit Document Features for Reputation Monitoring in Social Media. Knowl. Inf. Syst. 62, 3 (2020), 951--985. https://doi.org/10.1007/s10115-019-01383-w
[23]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446. https://doi.org/10. 1145/582415.582418
[24]
Johannes Kiesel, Damiano Spina, Henning Wachsmuth, and Benno Stein. 2021. The Meant, the Said, and the Understood: Conversational Argument Search and Cognitive Biases. In Proceedings of the 3rd Conference on Conversational User Interfaces (CUI '21). Association for Computing Machinery, New York, NY, USA, Article 20, 5 pages. https://doi.org/10.1145/3469595.3469615
[25]
Fei Liu, Alistair Moffat, Timothy Baldwin, and Xiuzhen Zhang. 2016. Quit While Ahead: Evaluating Truncated Rankings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '16). Association for Computing Machinery, New York, NY, USA, 953--956. https://doi.org/10.1145/2911451.2914737
[26]
Simon Mason and N.E. Graham. 2002. Areas Beneath the Relative Operating Characteristics (ROC) and Relative Operating Levels (ROL) Curves: Statistical Significance and Interpretation. Quarterly Journal of the Royal Meteorological Society 128 (07 2002), 2145 -- 2166. https://doi.org/10.1256/003590002320603584
[27]
Alistair Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics. In Information Retrieval Technology (AIRS'13). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--12. https://doi.org/10.1007/978--3--642--45068--6_1
[28]
Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Trans. Inf. Syst. 27, 1, Article 2 (dec 2008), 27 pages. https://doi.org/10.1145/1416950.1416952
[29]
Anselmo Peñas and Álvaro Rodrigo. 2011. A Simple Measure to Assess Nonresponse. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL '11). Association for Computational Linguistics, Portland, Oregon, USA, 1415--1424. https: //aclanthology.org/P11--1142
[30]
Tetsuya Sakai. 2004. New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering. In Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (NTCIR-4). National Institute of Informatics (NII), Tokyo, Japan, 17 pages. http://research.nii.ac.jp/ntcir/workshop/ OnlineProceedings4/OPEN/NTCIR4-OPEN-SakaiTrev.pdf
[31]
Tetsuya Sakai. 2009. On the Robustness of Information Retrieval Metrics to Biased Relevance Assessments. Journal of Information Processing 17 (2009), 156--166. https://doi.org/10.2197/ipsjjip.17.156
[32]
Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures Are "Good"?. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'19). Association for Computing Machinery, New York, NY, USA, 595--604. https://doi.org/10.1145/ 3331184.3331215
[33]
Falk Scholer, Diane Kelly, and Ben Carterette. 2016. Information Retrieval Evaluation Using Test Collections. Inf. Retr. 19, 3 (2016), 225--229. https: //doi.org/10.1007/s10791-016--9281--7
[34]
Mark D. Smucker and Charles L.A. Clarke. 2012. Time-Based Calibration of Effectiveness Measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12). Association for Computing Machinery, New York, NY, USA, 95--104. https: //doi.org/10.1145/2348283.2348300
[35]
Karen Sparck Jones and Cornelis J. van Rijsbergen. 1976. Information Retrieval Test Collections. J. Documentation 32, 1 (1976), 59--75. https://doi.org/10.1108/ eb026616
[36]
Damiano Spina, Johanne R. Trippas, Paul Thomas, Hideo Joho, Katriina Byström, Leigh Clark, Nick Craswell, Mary Czerwinski, David Elsweiler, Alexander Frummet, Souvick Ghosh, Johannes Kiesel, Irene Lopatovska, Daniel McDuff, Selina Meyer, Ahmed Mourad, Paul Owoicho, Sachin Pathiyan Cherumanal, Daniel Russell, and Laurianne Sitbon. 2021. Report on the Future Conversations Workshop at CHIIR 2021. SIGIR Forum 55, 1, Article 6 (jul 2021), 22 pages. https://doi.org/10.1145/3476415.3476421
[37]
Johanne R. Trippas, Damiano Spina, Paul Thomas, Mark Sanderson, Hideo Joho, and Lawrence Cavedon. 2020. Towards a Model for Spoken Conversational Search. Information Processing & Management 57, 2 (2020), 102162. https: //doi.org/10.1016/j.ipm.2019.102162
[38]
Ellen M. Voorhees. 2001. Evaluation by Highly Relevant Documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01). Association for Computing Machinery, New York, NY, USA, 74--82. https://doi.org/10.1145/383952.383963
[39]
Ellen M. Voorhees. 2002. The Philosophy of Information Retrieval Evaluation. In Evaluation of Cross-Language Information Retrieval Systems (CLEF '02), Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 355--370. https://doi.org/10.1007/3--540--45691- 0_34
[40]
Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. MIT Press Cambridge.
[41]
Hamed Zamani, Johanne R Trippas, Jeff Dalton, and Filip Radlinski. 2022. Conversational Information Seeking. arXiv preprint arXiv:2201.08808 (2022). https://doi.org/10.48550/arXiv.2201.08808

Cited By

View all
  • (2024)Walert: Putting Conversational Information Seeking Knowledge into Action by Building and Evaluating a Large Language Model-Powered ChatbotProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638309(401-405)Online publication date: 10-Mar-2024
  • (2024)Top-Personalized-K RecommendationProceedings of the ACM Web Conference 202410.1145/3589334.3645417(3388-3399)Online publication date: 13-May-2024
  • (2024)MileCut: A Multi-view Truncation Framework for Legal Case RetrievalProceedings of the ACM Web Conference 202410.1145/3589334.3645349(1341-1349)Online publication date: 13-May-2024

Index Terms

  1. Ranking Interruptus: When Truncated Rankings Are Better and How to Measure That

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation
    2. evaluation measures
    3. information retrieval
    4. ranking cutoff

    Qualifiers

    • Research-article

    Funding Sources

    • Spanish Ministry of Economic Affairs and Digital Transformation
    • Australian Research Council Centre of Excellence for Automated Decision-Making and Society
    • Australian Research Council

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Walert: Putting Conversational Information Seeking Knowledge into Action by Building and Evaluating a Large Language Model-Powered ChatbotProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638309(401-405)Online publication date: 10-Mar-2024
    • (2024)Top-Personalized-K RecommendationProceedings of the ACM Web Conference 202410.1145/3589334.3645417(3388-3399)Online publication date: 13-May-2024
    • (2024)MileCut: A Multi-view Truncation Framework for Legal Case RetrievalProceedings of the ACM Web Conference 202410.1145/3589334.3645349(1341-1349)Online publication date: 13-May-2024
    • (2023)Joint upper & expected value normalization for evaluation of retrieval systemsInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10340460:4Online publication date: 1-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media