ABSTRACT
Evaluation metrics for search and ranking systems are generally designed for a linear list of ranked items that does not have ties. However, ties in ranked lists arise naturally for certain systems or techniques. Evaluation protocols generally arbitrarily break ties in such lists, and compute the standard metrics. If the number of ties is non-trivial, it would be more principled to use modified, tie-aware formulations of these metrics. For most commonly used metrics, McSherry and Najork [5] present modified definitions that are tie-aware, and therefore, more appropriate for assessing the quality of systems that retrieve multiple distinct results at the same rank. This paper proposes a tie-aware version of Hit@k that we call ta-Hit@k. Hit@k is also a common evaluation measure that is widely used for some tasks, but is not covered in [5]. We also empirically compare the values of ta-Hit@k and Hit@k for a single example system on a standard benchmark task.
- Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proc of 28th ACM CIKM (CIKM '19). Association for Computing Machinery, New York, NY, USA, 729--738.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL. 4171--4186.Google Scholar
- Denys Katerenchuk and Andrew Rosenberg. 2016. RankDCG: Rank-Ordering Evaluation Measure. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), Portorož, Slovenia, 3675--3680. https://www.aclweb.org/anthology/L16-1583Google Scholar
- Xiaolu Lu, Soumajit Pramanik, Rishiraj Saha Roy, Abdalghani Abujabal, Yafang Wang, and Gerhard Weikum. 2019. Answering Complex Questions by Joining Multi-Document Evidence with Quasi Knowledge Graphs. In Proc. of 42nd SIGIR (SIGIR'19). 105--114.Google ScholarDigital Library
- Frank McSherry and Marc Najork. 2008. Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores. In Advances in Information Retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings (Lecture Notes in Computer Science), Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White (Eds.), Vol. 4956. Springer, 414--421. Google ScholarCross Ref
- Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, and Yiming Yang. 2020. A Re-evaluation of Knowledge Graph Completion Methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5516--5522. Google ScholarCross Ref
Index Terms
- On modifying evaluation measures to deal with ties in ranked lists
Recommendations
Learning to rank with ties
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalDesigning effective ranking functions is a core problem for information retrieval and Web search since the ranking functions directly impact the relevance of the search results. The problem has been the focus of much of the research at the intersection ...
Directly optimizing evaluation measures in learning to rank
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalOne of the central issues in learning to rank for information retrieval is to develop algorithms that construct ranking models by directly optimizing evaluation measures used in information retrieval such as Mean Average Precision (MAP) and Normalized ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Comments