skip to main content
10.1145/3394486.3403394acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

Published:20 August 2020Publication History

ABSTRACT

As deep learning-based models are deployed more widely in search & recommender systems, system designers often face the issue of gathering large amounts of well-annotated data to train such neural models. While most user-centric systems rely on interaction signals as implicit feedback to train models, such signals are often weak proxies of user satisfaction, as compared to (say) explicit judgments from users, which are prohibitively expensive to collect. In this paper, we consider the task of learning from limited labeled data, wherein we aim at jointly leveraging strong supervision data (e.g. explicit judgments) along with weak supervision data (e.g. implicit feedback or labels from the related task) to train neural models.

We present data mixing strategies based on submodular subset selection, and additionally, propose adaptive optimization techniques to enable the model to differentiate between a strong label data point and a weak supervision data point. Finally, we present two different case-studies (i) user satisfaction prediction with music recommendation and (ii) question-based video comprehension and demonstrate that the proposed adaptive learning strategies are better at learning from limited labels. Our techniques and findings provide practitioners with ways of leveraging external labeled data

References

  1. James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. [n.d.]. Algorithms for hyper-parameter optimization. In NIPS 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Carla E Brodley and Mark A Friedl. 1999. Identifying mislabeled training data. Journal of artificial intelligence research, Vol. 11 (1999), 131--167.Google ScholarGoogle ScholarCross RefCross Ref
  3. Mostafa Dehghani, Mehrjou, Gouws, Kamps, and Schölkopf. 2017a. Fidelity-weighted learning. arXiv preprint arXiv:1711.02799 (2017).Google ScholarGoogle Scholar
  4. Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. 2017b. Learning to learn from weak supervision by full supervision. arXiv preprint arXiv:1711.11383 (2017).Google ScholarGoogle Scholar
  5. Timothy Dozat. 2016. Incorporating nesterov momentum into adam. (2016).Google ScholarGoogle Scholar
  6. John Duchi, Elad Hazan, and Yoram Singer. [n.d.]. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 2011 ([n.,d.]).Google ScholarGoogle Scholar
  7. Uriel Feige. 1998. A threshold of ln n for approximating set cover. J. ACM (1998).Google ScholarGoogle Scholar
  8. Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. 2018. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA). IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  9. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  10. Julia Kiseleva, Kyle Williams, Hassan Awadallah, Crook, Zitouni, and Anastasakos. [n.d.]. Predicting user satisfaction with intelligent assistants. In SIGIR 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hui Lin and Jeff Bilmes. [n.d.]. Multi-document summarization via budgeted maximization of submodular functions. In NAACL 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 510--520.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eran Malach and Shai Shalev-Shwartz. 2017. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems. 960--970.Google ScholarGoogle Scholar
  14. Rishabh Mehrotra, Ahmed Hassan Awadallah, Milad Shokouhi, Emine Yilmaz, Imed Zitouni, Ahmed El Kholy, and Madian Khabsa. 2017a. Deep sequential models for task satisfaction prediction. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 737--746.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli Hashemian. 2019. Jointly leveraging intent and interaction signals to predict user satisfaction with slate recommendations. In The World Wide Web Conference. 1256--1267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rishabh Mehrotra and Emine Yilmaz. 2015. Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the 38th international ACM sigir conference on research and development in information retrieval. 545--554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy, and Madian Khabsa. 2017b. User interaction sequences for search satisfaction prediction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165--174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. [n.d.]. Distant supervision for relation extraction without labeled data. In ACL 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. George L Nemhauser and Laurence A Wolsey. 1988. Integer and combinatorial optimization. Vol. 18. Wiley New York.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming (1978).Google ScholarGoogle Scholar
  21. Giorgio Patrini, Frank Nielsen, and Carioni. [n.d.]. Loss factorization, weakly supervised learning and label noise robustness. In ICML 2016.Google ScholarGoogle Scholar
  22. Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks, Vol. 12, 1 (1999), 145--151.Google ScholarGoogle Scholar
  23. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000 questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).Google ScholarGoogle Scholar
  24. A Ratner, S Bach, P Varma, and C Ré. [n.d.]. Weak supervision: the new programming paradigm for machine learning. Hazy Research.Google ScholarGoogle Scholar
  25. Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. [n.d.]. On the importance of initialization and momentum in deep learning. In ICML 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T Tieleman and G Hinton. 2017. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical Report. (2017).Google ScholarGoogle Scholar
  28. Arash Vahdat. 2017. Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems. 5596--5605.Google ScholarGoogle Scholar
  29. Andre Wibisono and Ashia C Wilson. 2015. On accelerated methods in optimization. arXiv preprint arXiv:1509.03616 (2015).Google ScholarGoogle Scholar
  30. Kyle Williams, Julia Kiseleva, Aidan C Crook, Zitouni, Awadallah, and Khabsa. [n.d.]. Detecting good abandonment in mobile search. In WWW 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ashia C Wilson, Benjamin Recht, and Michael I Jordan. 2016. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        August 2020
        3664 pages
        ISBN:9781450379984
        DOI:10.1145/3394486

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 August 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24
      • Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader