research-article

Training Data Optimization for Pairwise Learning to Rank

Authors:

Seung-won Hwang,

Siyeon KimAuthors Info & Claims

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Pages 13 - 20

https://doi.org/10.1145/3409256.3409824

Published: 14 September 2020 Publication History

Abstract

This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.

Supplementary Material

MP4 File (3409256.3409824.mp4)

We solved a data optimization task for noisy and semi-supervised learning to rank scenarios.\r\nWe used influence functions, and we proposed two methods to improve the reliability of influence estimation, and another method to decrease computational cost while keeping the estimation quality reasonably.

Download
96.77 MB

References

[1]

Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 385--394.

Digital Library

[2]

Samyadeep Basu, Xuchen You, and Soheil Feizi. 2019. Second-Order Group Influence Functions for Black-Box Predictions. arXiv preprint arXiv:1911.00418 (2019).

[3]

Sebastian Bruch. 2019. An Alternative Cross Entropy Loss for Learning-to-Rank. arXiv preprint arXiv:1911.09798 (2019).

[4]

Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96.

Digital Library

[5]

Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23--581 (2010), 81.

[6]

Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web. 1--10.

Digital Library

[7]

Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the trec 2009 web track. Technical Report. WATERLOO UNIV (ONTARIO).

[8]

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).

[9]

Faïza Dammak, Hager Kammoun, and Abdelmajid Ben Hamadou. 2017. Improving pairwise learning to rank algorithms for document retrieval. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 1--8.

[10]

Dany Haddad and Joydeep Ghosh. 2019. Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 857--860.

Digital Library

[11]

Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133--142.

Digital Library

[12]

Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885--1894.

Digital Library

[13]

Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. 2019. On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems. 5254--5264.

[14]

Zhigao Miao, Juan Wang, Aimin Zhou, and Ke Tang. 2015. Regularized boost for semi-supervised ranking. In Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, Volume 1. Springer, 643--651.

[15]

Dae Hoon Park and Yi Chang. 2019. Adversarial Sampling and Training for Semi-Supervised Information Retrieval. In TheWorld WideWeb Conference. ACM, 1443--1453.

[16]

Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRRabs/1306.2597 (2013). http://arxiv.org/abs/1306.2597

[17]

Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).

[18]

Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 515--524.

Digital Library

[19]

Tianyang Wang, Jun Huan, and Bo Li. 2018. Data dropout: Optimizing training data for convolutional neural networks. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 39--46.

[20]

XuanhuiWang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1313--1322.

[21]

Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. 2019. Less Is Better: Unweighted Data Subsampling via Influence Function. arXiv preprint arXiv:1912.01321 (2019).

[22]

Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and Elbio Renato Torres Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the third ACM international conference on Web search and data mining. 171--180.

Digital Library

[23]

Minjie Xu and Gary Kazantsev. 2019. Understanding Goal-Oriented Active Learning via Influence Functions. arXiv preprint arXiv:1905.13183 (2019).

[24]

Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. Improving pseudorelevance feedback in web information retrieval using web page segmentation. In Proceedings of the 12th international conference on World Wide Web. ACM, 11--18.

Digital Library

Cited By

Lee CKim DKim WLee KAhn YLee KShin DLee Y(2025)Optimizing training data for persona-grounded dialogue via Synthetic Label AugmentationExpert Systems with Applications10.1016/j.eswa.2024.125796265(125796)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125796
Yehezkel AElyashiv E(2024)Finding an Optimal Small Sample of Training Dataset for Computer Vision Deep-Learning Models2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00223(1439-1446)Online publication date: 18-Dec-2024
https://doi.org/10.1109/ICMLA61862.2024.00223
Wu XLiu QQin JYu Y(2022)PeerRank: Robust Learning to Rank With Peer Loss Over Noisy LabelsIEEE Access10.1109/ACCESS.2022.314209610(6830-6841)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3142096
Show More Cited By

Index Terms

Training Data Optimization for Pairwise Learning to Rank
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank

Recommendations

Training query filtering for semi-supervised learning to rank with pseudo labels

Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Learning to Rank from Noisy Data

Learning to rank, which learns the ranking function from training data, has become an emerging research area in information retrieval and machine learning. Most existing work on learning to rank assumes that the training data is clean, which is not ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

September 2020

207 pages

ISBN:9781450380676

DOI:10.1145/3409256

General Chairs:
Krisztian Balog
University of Stavanger, Norway
,
Vinay Setty
University of Stavanger, Norway
,
Program Chairs:
Christina Lioma
University of Copenhagen, Denmark
,
Yiqun Liu
Tsinghua University, China
,
Min Zhang
Tsinghua University, China
,
Klaus Berberich
HTW Saar & MPI for Informatics, Germany

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICTIR '20

Sponsor:

SIGIR

ICTIR '20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval

September 14 - 17, 2020

Virtual Event, Norway

Acceptance Rates

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
353
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)5

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee CKim DKim WLee KAhn YLee KShin DLee Y(2025)Optimizing training data for persona-grounded dialogue via Synthetic Label AugmentationExpert Systems with Applications10.1016/j.eswa.2024.125796265(125796)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125796
Yehezkel AElyashiv E(2024)Finding an Optimal Small Sample of Training Dataset for Computer Vision Deep-Learning Models2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00223(1439-1446)Online publication date: 18-Dec-2024
https://doi.org/10.1109/ICMLA61862.2024.00223
Wu XLiu QQin JYu Y(2022)PeerRank: Robust Learning to Rank With Peer Loss Over Noisy LabelsIEEE Access10.1109/ACCESS.2022.314209610(6830-6841)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3142096
Zhu BLiu YShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)General Approximate Cross Validation for Model SelectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475649(5281-5289)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475649

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten