research-article

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

Authors:
Rishabh Mehrotra

Spotify Research, London, United Kingdom

Spotify Research, London, United Kingdom
View Profile

,
Ashish Gupta

Walmart Labs, Bangalore, India

Walmart Labs, Bangalore, India
View Profile

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAugust 2020Pages 3416–3425https://doi.org/10.1145/3394486.3403394

Published:20 August 2020Publication History

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3416–3425

ABSTRACT

As deep learning-based models are deployed more widely in search & recommender systems, system designers often face the issue of gathering large amounts of well-annotated data to train such neural models. While most user-centric systems rely on interaction signals as implicit feedback to train models, such signals are often weak proxies of user satisfaction, as compared to (say) explicit judgments from users, which are prohibitively expensive to collect. In this paper, we consider the task of learning from limited labeled data, wherein we aim at jointly leveraging strong supervision data (e.g. explicit judgments) along with weak supervision data (e.g. implicit feedback or labels from the related task) to train neural models.

We present data mixing strategies based on submodular subset selection, and additionally, propose adaptive optimization techniques to enable the model to differentiate between a strong label data point and a weak supervision data point. Finally, we present two different case-studies (i) user satisfaction prediction with music recommendation and (ii) question-based video comprehension and demonstrate that the proposed adaptive learning strategies are better at learning from limited labels. Our techniques and findings provide practitioners with ways of leveraging external labeled data

References

James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. [n.d.]. Algorithms for hyper-parameter optimization. In NIPS 2011.Google ScholarDigital Library
Carla E Brodley and Mark A Friedl. 1999. Identifying mislabeled training data. Journal of artificial intelligence research, Vol. 11 (1999), 131--167.Google ScholarCross Ref
Mostafa Dehghani, Mehrjou, Gouws, Kamps, and Schölkopf. 2017a. Fidelity-weighted learning. arXiv preprint arXiv:1711.02799 (2017).Google Scholar
Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. 2017b. Learning to learn from weak supervision by full supervision. arXiv preprint arXiv:1711.11383 (2017).Google Scholar
Timothy Dozat. 2016. Incorporating nesterov momentum into adam. (2016).Google Scholar
John Duchi, Elad Hazan, and Yoram Singer. [n.d.]. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 2011 ([n.,d.]).Google Scholar
Uriel Feige. 1998. A threshold of ln n for approximating set cover. J. ACM (1998).Google Scholar
Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. 2018. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA). IEEE, 1--9.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Julia Kiseleva, Kyle Williams, Hassan Awadallah, Crook, Zitouni, and Anastasakos. [n.d.]. Predicting user satisfaction with intelligent assistants. In SIGIR 2016.Google ScholarDigital Library
Hui Lin and Jeff Bilmes. [n.d.]. Multi-document summarization via budgeted maximization of submodular functions. In NAACL 2010.Google ScholarDigital Library
Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 510--520.Google ScholarDigital Library
Eran Malach and Shai Shalev-Shwartz. 2017. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems. 960--970.Google Scholar
Rishabh Mehrotra, Ahmed Hassan Awadallah, Milad Shokouhi, Emine Yilmaz, Imed Zitouni, Ahmed El Kholy, and Madian Khabsa. 2017a. Deep sequential models for task satisfaction prediction. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 737--746.Google ScholarDigital Library
Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli Hashemian. 2019. Jointly leveraging intent and interaction signals to predict user satisfaction with slate recommendations. In The World Wide Web Conference. 1256--1267.Google ScholarDigital Library
Rishabh Mehrotra and Emine Yilmaz. 2015. Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the 38th international ACM sigir conference on research and development in information retrieval. 545--554.Google ScholarDigital Library
Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy, and Madian Khabsa. 2017b. User interaction sequences for search satisfaction prediction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165--174.Google ScholarDigital Library
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. [n.d.]. Distant supervision for relation extraction without labeled data. In ACL 2009.Google ScholarDigital Library
George L Nemhauser and Laurence A Wolsey. 1988. Integer and combinatorial optimization. Vol. 18. Wiley New York.Google ScholarDigital Library
George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming (1978).Google Scholar
Giorgio Patrini, Frank Nielsen, and Carioni. [n.d.]. Loss factorization, weakly supervised learning and label noise robustness. In ICML 2016.Google Scholar
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks, Vol. 12, 1 (1999), 145--151.Google Scholar
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000 questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).Google Scholar
A Ratner, S Bach, P Varma, and C Ré. [n.d.]. Weak supervision: the new programming paradigm for machine learning. Hazy Research.Google Scholar
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 269.Google ScholarDigital Library
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. [n.d.]. On the importance of initialization and momentum in deep learning. In ICML 2013.Google ScholarDigital Library
T Tieleman and G Hinton. 2017. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical Report. (2017).Google Scholar
Arash Vahdat. 2017. Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems. 5596--5605.Google Scholar
Andre Wibisono and Ashia C Wilson. 2015. On accelerated methods in optimization. arXiv preprint arXiv:1509.03616 (2015).Google Scholar
Kyle Williams, Julia Kiseleva, Aidan C Crook, Zitouni, Awadallah, and Khabsa. [n.d.]. Detecting good abandonment in mobile search. In WWW 2016.Google ScholarDigital Library
Ashia C Wilson, Benjamin Recht, and Michael I Jordan. 2016. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016).Google Scholar

Index Terms

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Mathematics of computing
  1. Mathematical analysis
    1. Mathematical optimization
      1. Continuous optimization
        Stochastic control and optimization

Recommendations

Learning with limited and noisy tagging
MM '13: Proceedings of the 21st ACM international conference on Multimedia

With the rapid development of social networks, tagging has become an important means responsible for such rapid development. A robust tagging method must have the capability to meet the two challenging requirements: limited labeled training samples and ...
Read More
Partial Label Learning via Feature-Aware Disambiguation
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Partial label learning deals with the problem where each training example is represented by a feature vector while associated with a set of candidate labels, among which only one label is valid. To learn from such ambiguous labeling information, the key ...
Read More
SPL-LDP: a label distribution propagation method for semi-supervised partial label learning
Abstract
Partial label learning learns from examples represented by a single instance while associated with multiple candidate labels, among which only one valid label resides. However, in real-world applications, collecting candidate label sets for all ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
limited labels
stochastic optimization
weak supervision
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 287
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning with limited and noisy tagging

Partial Label Learning via Feature-Aware Disambiguation

SPL-LDP: a label distribution propagation method for semi-supervised partial label learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning with limited and noisy tagging

Partial Label Learning via Feature-Aware Disambiguation

SPL-LDP: a label distribution propagation method for semi-supervised partial label learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media