skip to main content
research-article

Assessing Top- Preferences

Published:05 May 2021Publication History
Skip Abstract Section

Abstract

Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this article, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure. We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named compatibility. This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.

References

  1. Mustafa Abualsaud and Mark D. Smucker. 2019. Patterns of search result examination: Query to first action. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1833–1842.Google ScholarGoogle Scholar
  2. Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (Nov. 2008), 9–15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A document rating system for preference judgements. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 909–912.Google ScholarGoogle Scholar
  5. Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR-2010-82.Google ScholarGoogle Scholar
  6. Ben Carterette, Paul Bennett, and Olivier Chapelle. 2008. A test collection of preference judgments. In Proceedings of the SIGIR Workshop on Beyond Binary Relevance: Preferences, Diversity, and Set-level Judgments.Google ScholarGoogle Scholar
  7. Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–686.Google ScholarGoogle Scholar
  8. Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. Computer Science Department Faculty Publication Series 46. University of Massachusetts, Amherst.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 861–870.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 413–422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621–630.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 193–202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charles L. A. Clarke. 2019. WaterlooClarke at the TREC 2019 conversational assistant track. In Proceedings of the 28th Text REtrieval Conference.Google ScholarGoogle Scholar
  14. Charles L. A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline evaluation by maximum similarity to an ideal ranking. In Proceedings of the 29th ACM Conference on Information and Knowledge Management.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Offline evaluation without gain. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval.Google ScholarGoogle Scholar
  16. Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the 28th Text REtrieval Conference.Google ScholarGoogle Scholar
  17. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  18. H. P. Frei and P. Schäuble. 1991. Determining the effectiveness of retrieval algorithms. Inf. Proc. Manag. 27, 2–3 (Apr. 1991), 153–164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Laura A. Granka, Thorsten Joachims, and Geri Gay. 2004. Eye-tracking analysis of user behavior in WWW search. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 478–479.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ahmed Hassan Awadallah and Imed Zitouni. 2014. Machine-assisted search preference evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 51–60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kai Hui and Klaus Berberich. 2017. Low-cost preference judgment via ties. In Proceedings of the European Conference on Information Retrieval. 626–632.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kai Hui and Klaus Berberich. 2017. Transitivity, time consumption, and quality of preference judgments in crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 239–251.Google ScholarGoogle ScholarCross RefCross Ref
  23. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 329–337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S. M. M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 699–708.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jinyoung Kim, Gabriella Kazai, and Imed Zitouni. 2013. Relevance dimensions in preference-based IR evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 913–916.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. In Proceedings of the CHI Conference on Human Factors in Computing Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yanyan Lan, Shuzi Niu, Jiafeng Guo, and Xueqi Cheng. 2013. Is top-\(\) sufficient for ranking? In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1261–1270.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (Jan. 2012), 66–75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Trans. Inf. Syst. 35, 3 (Jan. 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: Labeling, ranking and evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 751–760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. 105–114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mark E. Rorvig. 1990. The simple scalability of documents. J. Amer. Soc. Inf. Sci. 41, 8 (1990), 590–598.Google ScholarGoogle ScholarCross RefCross Ref
  34. Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tefko Saracevic. 2017. The Notion of Relevance in Information Science: Everybody Knows What Relevance Is. But, What Is It Really? Morgan & Claypool, San Rafael, CA.Google ScholarGoogle Scholar
  37. Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (Nov. 2010), 20:1–20:38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fen Xia, Tie-Yan Liu, and Hang Li. 2009. Statistical consistency of top-\(\) ranking. In Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2098–2106.Google ScholarGoogle Scholar
  40. Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Haitian Chen, Min Zhang, and Shaoping Ma. 2020. Preference-based evaluation metrics for web image search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jheng-Hong Yang, Sheng-Chieh Lin, Chuan-Ju Wang, Jimmy Lin, and Ming-Feng Tsai. 2019. Query and answer expansion from conversation history. In Proceedings of the 28th Text REtrieval Conference.Google ScholarGoogle Scholar
  42. Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129–1132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. J. Amer. Soc. Inf. Sci. 46, 2 (1995), 133–145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation.Google ScholarGoogle Scholar

Index Terms

  1. Assessing Top- Preferences

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 39, Issue 3
      July 2021
      432 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3450607
      Issue’s Table of Contents

      Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 May 2021
      • Revised: 1 February 2021
      • Accepted: 1 February 2021
      • Received: 1 July 2020
      Published in tois Volume 39, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format