skip to main content
10.1145/3409256.3409824acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

Training Data Optimization for Pairwise Learning to Rank

Published: 14 September 2020 Publication History

Abstract

This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.

Supplementary Material

MP4 File (3409256.3409824.mp4)
We solved a data optimization task for noisy and semi-supervised learning to rank scenarios.\r\nWe used influence functions, and we proposed two methods to improve the reliability of influence estimation, and another method to decrease computational cost while keeping the estimation quality reasonably.

References

[1]
Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 385--394.
[2]
Samyadeep Basu, Xuchen You, and Soheil Feizi. 2019. Second-Order Group Influence Functions for Black-Box Predictions. arXiv preprint arXiv:1911.00418 (2019).
[3]
Sebastian Bruch. 2019. An Alternative Cross Entropy Loss for Learning-to-Rank. arXiv preprint arXiv:1911.09798 (2019).
[4]
Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96.
[5]
Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23--581 (2010), 81.
[6]
Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web. 1--10.
[7]
Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the trec 2009 web track. Technical Report. WATERLOO UNIV (ONTARIO).
[8]
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).
[9]
Faïza Dammak, Hager Kammoun, and Abdelmajid Ben Hamadou. 2017. Improving pairwise learning to rank algorithms for document retrieval. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 1--8.
[10]
Dany Haddad and Joydeep Ghosh. 2019. Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 857--860.
[11]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133--142.
[12]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885--1894.
[13]
Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. 2019. On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems. 5254--5264.
[14]
Zhigao Miao, Juan Wang, Aimin Zhou, and Ke Tang. 2015. Regularized boost for semi-supervised ranking. In Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, Volume 1. Springer, 643--651.
[15]
Dae Hoon Park and Yi Chang. 2019. Adversarial Sampling and Training for Semi-Supervised Information Retrieval. In TheWorld WideWeb Conference. ACM, 1443--1453.
[16]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRRabs/1306.2597 (2013). http://arxiv.org/abs/1306.2597
[17]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).
[18]
Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 515--524.
[19]
Tianyang Wang, Jun Huan, and Bo Li. 2018. Data dropout: Optimizing training data for convolutional neural networks. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 39--46.
[20]
XuanhuiWang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1313--1322.
[21]
Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. 2019. Less Is Better: Unweighted Data Subsampling via Influence Function. arXiv preprint arXiv:1912.01321 (2019).
[22]
Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and Elbio Renato Torres Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the third ACM international conference on Web search and data mining. 171--180.
[23]
Minjie Xu and Gary Kazantsev. 2019. Understanding Goal-Oriented Active Learning via Influence Functions. arXiv preprint arXiv:1905.13183 (2019).
[24]
Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. Improving pseudorelevance feedback in web information retrieval using web page segmentation. In Proceedings of the 12th international conference on World Wide Web. ACM, 11--18.

Cited By

View all
  • (2025)Optimizing training data for persona-grounded dialogue via Synthetic Label AugmentationExpert Systems with Applications10.1016/j.eswa.2024.125796265(125796)Online publication date: Mar-2025
  • (2024)Finding an Optimal Small Sample of Training Dataset for Computer Vision Deep-Learning Models2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00223(1439-1446)Online publication date: 18-Dec-2024
  • (2022)PeerRank: Robust Learning to Rank With Peer Loss Over Noisy LabelsIEEE Access10.1109/ACCESS.2022.314209610(6830-6841)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval
September 2020
207 pages
ISBN:9781450380676
DOI:10.1145/3409256
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. influence functions
  2. learning to rank
  3. noisy data
  4. semi-supervised learning

Qualifiers

  • Research-article

Conference

ICTIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)5
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Optimizing training data for persona-grounded dialogue via Synthetic Label AugmentationExpert Systems with Applications10.1016/j.eswa.2024.125796265(125796)Online publication date: Mar-2025
  • (2024)Finding an Optimal Small Sample of Training Dataset for Computer Vision Deep-Learning Models2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00223(1439-1446)Online publication date: 18-Dec-2024
  • (2022)PeerRank: Robust Learning to Rank With Peer Loss Over Noisy LabelsIEEE Access10.1109/ACCESS.2022.314209610(6830-6841)Online publication date: 2022
  • (2021)General Approximate Cross Validation for Model SelectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475649(5281-5289)Online publication date: 17-Oct-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media