ABSTRACT
As deep learning-based models are deployed more widely in search & recommender systems, system designers often face the issue of gathering large amounts of well-annotated data to train such neural models. While most user-centric systems rely on interaction signals as implicit feedback to train models, such signals are often weak proxies of user satisfaction, as compared to (say) explicit judgments from users, which are prohibitively expensive to collect. In this paper, we consider the task of learning from limited labeled data, wherein we aim at jointly leveraging strong supervision data (e.g. explicit judgments) along with weak supervision data (e.g. implicit feedback or labels from the related task) to train neural models.
We present data mixing strategies based on submodular subset selection, and additionally, propose adaptive optimization techniques to enable the model to differentiate between a strong label data point and a weak supervision data point. Finally, we present two different case-studies (i) user satisfaction prediction with music recommendation and (ii) question-based video comprehension and demonstrate that the proposed adaptive learning strategies are better at learning from limited labels. Our techniques and findings provide practitioners with ways of leveraging external labeled data
- James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. [n.d.]. Algorithms for hyper-parameter optimization. In NIPS 2011.Google ScholarDigital Library
- Carla E Brodley and Mark A Friedl. 1999. Identifying mislabeled training data. Journal of artificial intelligence research, Vol. 11 (1999), 131--167.Google ScholarCross Ref
- Mostafa Dehghani, Mehrjou, Gouws, Kamps, and Schölkopf. 2017a. Fidelity-weighted learning. arXiv preprint arXiv:1711.02799 (2017).Google Scholar
- Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. 2017b. Learning to learn from weak supervision by full supervision. arXiv preprint arXiv:1711.11383 (2017).Google Scholar
- Timothy Dozat. 2016. Incorporating nesterov momentum into adam. (2016).Google Scholar
- John Duchi, Elad Hazan, and Yoram Singer. [n.d.]. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 2011 ([n.,d.]).Google Scholar
- Uriel Feige. 1998. A threshold of ln n for approximating set cover. J. ACM (1998).Google Scholar
- Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. 2018. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA). IEEE, 1--9.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Julia Kiseleva, Kyle Williams, Hassan Awadallah, Crook, Zitouni, and Anastasakos. [n.d.]. Predicting user satisfaction with intelligent assistants. In SIGIR 2016.Google ScholarDigital Library
- Hui Lin and Jeff Bilmes. [n.d.]. Multi-document summarization via budgeted maximization of submodular functions. In NAACL 2010.Google ScholarDigital Library
- Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 510--520.Google ScholarDigital Library
- Eran Malach and Shai Shalev-Shwartz. 2017. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems. 960--970.Google Scholar
- Rishabh Mehrotra, Ahmed Hassan Awadallah, Milad Shokouhi, Emine Yilmaz, Imed Zitouni, Ahmed El Kholy, and Madian Khabsa. 2017a. Deep sequential models for task satisfaction prediction. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 737--746.Google ScholarDigital Library
- Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli Hashemian. 2019. Jointly leveraging intent and interaction signals to predict user satisfaction with slate recommendations. In The World Wide Web Conference. 1256--1267.Google ScholarDigital Library
- Rishabh Mehrotra and Emine Yilmaz. 2015. Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the 38th international ACM sigir conference on research and development in information retrieval. 545--554.Google ScholarDigital Library
- Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy, and Madian Khabsa. 2017b. User interaction sequences for search satisfaction prediction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165--174.Google ScholarDigital Library
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. [n.d.]. Distant supervision for relation extraction without labeled data. In ACL 2009.Google ScholarDigital Library
- George L Nemhauser and Laurence A Wolsey. 1988. Integer and combinatorial optimization. Vol. 18. Wiley New York.Google ScholarDigital Library
- George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming (1978).Google Scholar
- Giorgio Patrini, Frank Nielsen, and Carioni. [n.d.]. Loss factorization, weakly supervised learning and label noise robustness. In ICML 2016.Google Scholar
- Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks, Vol. 12, 1 (1999), 145--151.Google Scholar
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000 questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).Google Scholar
- A Ratner, S Bach, P Varma, and C Ré. [n.d.]. Weak supervision: the new programming paradigm for machine learning. Hazy Research.Google Scholar
- Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 269.Google ScholarDigital Library
- Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. [n.d.]. On the importance of initialization and momentum in deep learning. In ICML 2013.Google ScholarDigital Library
- T Tieleman and G Hinton. 2017. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical Report. (2017).Google Scholar
- Arash Vahdat. 2017. Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems. 5596--5605.Google Scholar
- Andre Wibisono and Ashia C Wilson. 2015. On accelerated methods in optimization. arXiv preprint arXiv:1509.03616 (2015).Google Scholar
- Kyle Williams, Julia Kiseleva, Aidan C Crook, Zitouni, Awadallah, and Khabsa. [n.d.]. Detecting good abandonment in mobile search. In WWW 2016.Google ScholarDigital Library
- Ashia C Wilson, Benjamin Recht, and Michael I Jordan. 2016. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016).Google Scholar
Index Terms
- Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization
Recommendations
Learning with limited and noisy tagging
MM '13: Proceedings of the 21st ACM international conference on MultimediaWith the rapid development of social networks, tagging has become an important means responsible for such rapid development. A robust tagging method must have the capability to meet the two challenging requirements: limited labeled training samples and ...
Partial Label Learning via Feature-Aware Disambiguation
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningPartial label learning deals with the problem where each training example is represented by a feature vector while associated with a set of candidate labels, among which only one label is valid. To learn from such ambiguous labeling information, the key ...
SPL-LDP: a label distribution propagation method for semi-supervised partial label learning
AbstractPartial label learning learns from examples represented by a single instance while associated with multiple candidate labels, among which only one valid label resides. However, in real-world applications, collecting candidate label sets for all ...
Comments