Abstract
Learning-to-rank (LtR) algorithms for information retrieval use the supervised learning framework to learn a ranking function from a training set consisting of query-document pairs. In this study we investigate the imbalanced nature of LtR training sets, which generally contain very few relevant documents as compared to the number of irrelevant documents. The need to include as many relevant documents as possible in the training set is well-known, but we ask the question as to how many irrelevant documents are needed in order to learn a good ranking function. We employ both random and deterministic undersampling techniques to reduce the number of irrelevant documents. Minimizing the training set size reduces the training time which is an important factor in large scale LtR. Extensive experiments on Letor benchmark datasets reveal that the performance of a LtR algorithm trained on a much smaller training set remains similar to that of the original training set. Thus this study suggests that for large scale LtR tasks, we can leverage undersampling techniques to reduce training time with negligible effect on performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aslam, J.A., Kanoulas, E., Pavlu, V., Savev, S., Yilmaz, E.: Document selection methodologies for efficient and effective learning-to-rank. In: Proc. of the 32nd International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 468–475. ACM (2009)
Bendersky, M., Metzler, D., Croft, W.B.: Learning concept importance using a weighted dependence model. In: Proc. of 3rd ACM Intl. Conf. on Web Search and Data Mining, pp. 31–40. ACM (2010)
Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview. Journal of Machine Learning Research-Proceedings Track 14, 1–24 (2011)
Chapelle, O., Chang, Y., Liu, T.Y.: Future directions in learning to rank. In: JMLR Workshop and Conference Proceedings, vol. 14, pp. 91–100 (2011)
Dang, V., Bendersky, M., Croft, W.B.: Two-stage learning to rank for information retrieval. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 423–434. Springer, Heidelberg (2013)
Donmez, P., Carbonell, J.G.: Optimizing estimated loss reduction for active sampling in rank learning. In: Proc. of 25th International Conf. on Machine Learning, pp. 248–255 (2008)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Statist 29(5), 1189–1232 (2001) (english summary)
Ganjisaffar, Y., Caruana, R., Lopes, C.V.: Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR Conference on Research and development in Information Retrieval, pp. 85–94. ACM (2011)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Järvelin, K., Kekäläinen, J.: Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41–48. ACM (2000)
Joachims, T.: Optimizing search engines using clickthrough data. In: Proc. of 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002)
Li, H.: Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4(1), 1–113 (2011)
Liu, T.Y.: Learning to rank for information retrieval. Springer, Heidelberg (2011)
Long, B., Chapelle, O., Zhang, Y., Chang, Y., Zheng, Z., Tseng, B.: Active learning for ranking through expected loss optimization. In: Proceedings of the 33rd International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 267–274. ACM (2010)
Macdonald, C., Santos, R.L., Ounis, I.: The whens and hows of learning to rank for web search. Information Retrieval, 1–45 (2012)
Pan, F., Converse, T., Ahn, D., Salvetti, F., Donato, G.: Feature selection for ranking using boosted trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2025–2028. ACM (2009)
Pavlu, V.: Large scale ir evaluation. ProQuest LLC (2008)
Qin, T., Liu, T.-Y., Xu, J., Li, H.: Letor: A benchmark collection for research on learning to rank for information retrieval. information Retrieval 13(4), 346–374 (2010)
Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Information Retrieval 13(3), 254–270 (2010)
Yu, H.: Svm selective sampling for ranking with application to data retrieval. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 354–363. ACM (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ibrahim, M., Carman, M. (2014). Undersampling Techniques to Re-balance Training Data for Large Scale Learning-to-Rank. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-12844-3_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12843-6
Online ISBN: 978-3-319-12844-3
eBook Packages: Computer ScienceComputer Science (R0)