ABSTRACT
In this paper, we propose a novel top-k learning to rank framework, which involves labeling strategy, ranking model and evaluation measure. The motivation comes from the difficulty in obtaining reliable relevance judgments from human assessors when applying learning to rank in real search systems. The traditional absolute relevance judgment method is difficult in both gradation specification and human assessing, resulting in high level of disagreement on judgments. While the pairwise preference judgment, as a good alternative, is often criticized for increasing the complexity of judgment from O(n) to (n log n). Considering the fact that users mainly care about top ranked search results, we propose a novel top-k labeling strategy which adopts the pairwise preference judgment to generate the top k ordering items from n documents (i.e. top-k ground-truth) in a manner similar to that of HeapSort. As a result, the complexity of judgment is reduced to O(n log k). With the top-k ground-truth, traditional ranking models (e.g. pairwise or listwise models) and evaluation measures (e.g. NDCG) no longer fit the data set. Therefore, we introduce a new ranking model, namely FocusedRank, which fully captures the characteristics of the top-k ground-truth. We also extend the widely used evaluation measures NDCG and ERR to be applicable to the top-k ground-truth, referred as κ-NDCG and κ-ERR, respectively. Finally, we conduct extensive experiments on benchmark data collections to demonstrate the efficiency and effectiveness of our top-k labeling strategy and ranking models.
- N. Ailon and M. Mohri. An efficient reduction of ranking to classification. COLT '08, pages 87--98, 2008.Google Scholar
- C. Buckley and E. M. Voorhees. Retrieval system evaluation, chapter TREC: experiment and evaluation in information retrieval. MIT press, 2005.Google Scholar
- C. Burges, T. Shaked, and et al. Learning to rank using gradient descent. ICML '05, pages 89--96, 2005. Google ScholarDigital Library
- R. Burgin. Variations in relevance judgments and the evaluation of retrieval performance. IPM, 28:619--627, 1992. Google ScholarDigital Library
- Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. ICML '07, pages 129--136, 2007. Google ScholarDigital Library
- B. Carterette and P. N. Bennett. Evaluation measures for preference judgments. SIGIR '08, pages 685--686, 2008. Google ScholarDigital Library
- B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: preference judgments for relevance. ECIR'08, pages 16--27, 2008. Google ScholarDigital Library
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. CIKM '09, pages 621--630. ACM, 2009. Google ScholarDigital Library
- S. Clémençon and N. Vayatis. Ranking the best instances. JMLR, 8:2671--2699, 2007. Google ScholarDigital Library
- D. Cossock and T. Zhang. Subset ranking using regression. Learning theory, 4005:605--619, 2006. Google ScholarDigital Library
- Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. JMLR, 4:933--969, 2003. Google ScholarDigital Library
- E. P. C. III. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research, 17, No. 4:407--422.Google Scholar
- K. Jarvelin and J. Kek\"al\"ainen. Ir evaluation methods for retrieving highly relevant documents. SIGIR '00, pages 41--48, 2000. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. KDD '02, pages 133--142, 2002. Google ScholarDigital Library
- J. Kekalainen. Binary and graded relevance in ir evaluations-comparison of the effects on ranking of ir systems. IPM, 41:1019--1033, 2005. Google ScholarDigital Library
- A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27:2:1--2:27, 2008. Google ScholarDigital Library
- L. P., B. C., and W. Q. Mcrank: learning to rank using multiple classification and gradient boosting. In NIPS2007, pages 845--852.Google Scholar
- T. Qin, T.-Y. Liu, and et al. Letor: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13:346--374, 2010. Google ScholarDigital Library
- T. R., S. W. Jr., and V. J.L. Towards the identification of the optimal number of relevance categories. JASIS, 50:254--264, 1999. Google ScholarDigital Library
- K. Radinsky and N. Ailon. Ranking from pairs and triplets: information quality, evaluation methods and query complexity. WSDM '11, pages 105--114, 2011. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. KDD '05, pages 239--248, 2005. Google ScholarDigital Library
- F. G. Rebecca and N. Melisa. The neutral point on a likert scale. Journal of Psychology, 95:199--204, 1971.Google Scholar
- P. R.L. The analysis of permutations. Applied Statistics, 24(2):193--202, 1974.Google Scholar
- M. Rorvig. The simple scalability of documents. JASIS, 41:590--598, 1990.Google ScholarCross Ref
- C. Rudin. Ranking with a p-norm push. In COLT, pages 589--604, 2006. Google ScholarDigital Library
- R. Song, Q. Guo, R. Zhang, and et al. Select-the-best-ones: A new way to judge relative relevance. IPM, 47:37--52, 2011. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. SIGIR '98, pages 315--323. ACM, 1998. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. IPM, 36:697--716, 2000. Google ScholarDigital Library
- E. M. Voorhees. Evaluation by highly relevant documents. SIGIR '01, pages 74--82. ACM, 2001. Google ScholarDigital Library
- F. Xia, T.-Y. Liu, and H. Li. Statistical consistency of top-k ranking. In NIPS, pages 2098--2106, 2009.Google ScholarDigital Library
- J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. SIGIR '07, pages 391--398, 2007. Google ScholarDigital Library
- Yao. Measuring retrieval effectiveness based on user preference of documents. JASIS, 46:133--145, 1995. Google ScholarDigital Library
- Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. SIGIR '07, pages 271--278, 2007. Google ScholarDigital Library
- B. Zhou and Y. Yao. Evaluating information retrieval system performance based on user preference. JIIS, 34:227--248, 2010. Google ScholarDigital Library
Index Terms
- Top-k learning to rank: labeling, ranking and evaluation
Recommendations
Policy-Aware Unbiased Learning to Rank for Top-k Rankings
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalCounterfactual Learning to Rank (LTR) methods optimize ranking systems using logged user interactions that contain interaction biases. Existing methods are only unbiased if users are presented with all relevant items in every ranking. There is currently ...
Is top-k sufficient for ranking?
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementRecently,`top-k learning to rank' has attracted much attention in the community of information retrieval. The motivation comes from the difficulty in obtaining a full-order ranking list for training, when employing reliable pairwise preference judgment. ...
Learning to rank code examples for code search engines
Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user'...
Comments