research-article

Free Access

Dependence and Model Selection in LLP: The Problem of Variants

Authors:
Gabriel Franco

Boston University, Boston, MA, USA

Boston University, Boston, MA, USA

0000-0003-0702-0146
View Profile

,
Mark Crovella

Boston University, Boston, MA, USA

Boston University, Boston, MA, USA

0000-0002-5005-7019
View Profile

,
Giovanni Comarela

Universidade Federal do Espírito Santo, Vitória, Brazil

Universidade Federal do Espírito Santo, Vitória, Brazil

0000-0001-7612-9650
View Profile

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAugust 2023Pages 470–481https://doi.org/10.1145/3580305.3599307

Published:04 August 2023Publication History

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 470–481

ABSTRACT

The problem of Learning from Label Proportions (LLP) has received considerable research attention and has numerous practical applications. In LLP, a hypothesis assigning labels to items is learned using knowledge of only the proportion of labels found in predefined groups, called bags. While a number of algorithmic approaches to learning in this context have been proposed, very little work has addressed the model selection problem for LLP. Nonetheless, it is not obvious how to extend straightforward model selection approaches to LLP, in part because of the lack of item labels. More fundamentally, we argue that a careful approach to model selection for LLP requires consideration of the dependence structure that exists between bags, items, and labels. In this paper we formalize this structure and show how it affects model selection. We show how this leads to improved methods of model selection that we demonstrate outperform the state of the art over a wide range of datasets and LLP algorithms.

Supplemental Material

rtfp1030-2min-promo.mp4

mp4

156.5 MB

Download

References

Ehsan Mohammady Ardehaly and Aron Culotta. 2016. Domain Adaptation for Learning from Label Proportions Using Self-Training. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (New York, New York, USA). 3670--3676.Google ScholarDigital Library
Ehsan Mohammady Ardehaly and Aron Culotta. 2017. Co-training for demographic classification using deep learning from label proportions. In 2017 IEEE International Conference on Data Mining Workshops. IEEE, 1017--1024.Google ScholarCross Ref
Denis Baručić and Jan Kybic. 2021. Fast learning from label proportions with small bags. arXiv preprint arXiv:2110.03426 (2021).Google Scholar
Jing Chai and Ivor W Tsang. 2021. Learning With Label Proportions by Incorporating Unmarked Data. IEEE Transactions on Neural Networks and Learning Systems (2021).Google Scholar
Zhensong Chen, Wei Chen, and Yong Shi. 2020. Ensemble learning with label proportions for bankruptcy prediction. Expert Systems with Applications 146 (2020), 113155.Google ScholarDigital Library
Zhensong Chen, Zhiquan Qi, Bo Wang, Limeng Cui, Fan Meng, and Yong Shi. 2017. Learning with label proportions based on nonparallel support vector machines. Knowledge-Based Systems 119 (2017), 126--141.Google ScholarDigital Library
Giovanni Comarela, Ramakrishnan Durairajan, Paul Barford, Dino Christenson, and Mark Crovella. 2018. Assessing Candidate Preference through Web Browsing History. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), 158--167. https://doi.org/10.1145/ 3219819.3219884Google ScholarDigital Library
A. P. Dawid. 1979. Conditional Independence in Statistical Theory. Journal of the Royal Statistical Society: Series B (Methodological) 41, 1 (1979), 1--15. https: //doi.org/10.1111/j.2517-6161.1979.tb01052.xGoogle ScholarCross Ref
Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert. 2019. Deep multi-class learning from label proportions. arXiv preprint arXiv:1905.12909 (2019).Google Scholar
Seth R. Flaxman, Yu-Xiang Wang, and Alexander J. Smola. 2015. Who Supported Obama in 2012? Ecological Inference through Distribution Regression. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD '15). Association for Computing Machinery, New York, NY, USA, 289--298. https://doi.org/10.1145/2783258.2783300Google ScholarDigital Library
Maxime Gasse and Alex Aussem. 2016. Identifying the irreducible disjoint factors of a multivariate probability distribution. In Probabilistic Graphical Models. Lugano, Switzerland, 183--194.Google Scholar
Jerónimo Hernández-González. 2019. A framework for evaluation in learning from label proportions. Progress in Artificial Intelligence 8, 3 (2019), 359--373.Google ScholarDigital Library
Jerónimo Hernández-González, Inaki Inza, Lorena Crisol-Ortíz, María A Guembe, María J Iñarra, and Jose A Lozano. 2018. Fitting the data from embryo implantation prediction: Learning from label proportions. Statistical methods in medical research 27, 4 (2018), 1056--1066.Google Scholar
Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.Google ScholarDigital Library
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).Google Scholar
Laura Elena Cué La Rosa and Dário Augusto Borges Oliveira. 2022. Learning from Label Proportions with Prototypical Contrastive Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2153--2161.Google Scholar
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).Google Scholar
Jiabin Liu, Zhiquan Qi, Bo Wang, YingJie Tian, and Yong Shi. 2022. SELF-LLP: Self-supervised learning from label proportions with self-ensemble. Pattern Recognition 129 (2022), 108767.Google ScholarDigital Library
Jiabin Liu, Bo Wang, Hanyuan Hang, Huadong Wang, Zhiquan Qi, Yingjie Tian, and Yong Shi. 2022. Llp-gan: a gan-based algorithm for learning from label proportions. IEEE Transactions on Neural Networks and Learning Systems (2022).Google Scholar
Jiabin Liu, Bo Wang, Zhiquan Qi, YingJie Tian, and Yong Shi. 2019. Learning from Label Proportions with Generative Adversarial Networks. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
Jiabin Liu, Bo Wang, Xin Shen, Zhiquan Qi, and Yingjie Tian. 2021. Two-stage Training for Learning from Label Proportions. arXiv preprint arXiv:2105.10635 (2021).Google Scholar
Jay Nandy, Rishi Saket, Prateek Jain, Jatin Chauhan, Balaraman Ravindran, and Aravindan Raghuveer. 2022. Domain-Agnostic Contrastive Representations for Learning from Label Proportions. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1542--1551.Google ScholarDigital Library
H. James Norton and George Divine. 2015. Simpson's paradox - and how to avoid it. Significance 12, 4 (2015), 40--43. https://doi.org/10.1111/j.1740-9713. 2015.00844.x arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2015.00844.xGoogle ScholarCross Ref
Giorgio Patrini, Richard Nock, Paul Rivera, and Tiberio Caetano. 2014. (Almost) no label no cry. Advances in Neural Information Processing Systems 27 (2014), 190--198.Google Scholar
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google ScholarDigital Library
Rafael Poyiadzi, Raul Santos-Rodriguez, and Niall Twomey. 2018. Label propagation for learning with label proportions. In 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1--6.Google ScholarCross Ref
Rafael Poyiadzi, Raul Santos-Rodriguez, and Niall Twomey. 2019. Active learning with label proportions. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3097--3101.Google ScholarCross Ref
Zhiquan Qi, Fan Meng, Yingjie Tian, Lingfeng Niu, Yong Shi, and Peng Zhang. 2018. Adaboost-LLP: A Boosting Method for Learning With Label Proportions. IEEE Transactions on Neural Networks and Learning Systems 29, 8 (2018), 3548--3559. https://doi.org/10.1109/TNNLS.2017.2727065Google ScholarCross Ref
Zhiquan Qi, Bo Wang, Fan Meng, and Lingfeng Niu. 2016. Learning with label proportions via NPSVM. IEEE transactions on cybernetics 47, 10 (2016), 3293--3305.Google Scholar
Yaxing Qian, Qiang Tong, and Bo Wang. 2019. Multi-Class Learning from Label Proportions for Bank Customer Classification. Procedia Computer Science 162 (2019), 421--428.Google ScholarDigital Library
Yue Qiu, Mingjie Yan, and Zhensong Chen. 2021. Active learning from label proportions via pSVM. Neurocomputing 464 (2021), 227--241.Google ScholarDigital Library
Novi Quadrianto, Alex J Smola, Tiberio S Caetano, and Quoc V Le. 2009. Estimating labels from label proportions. Journal of Machine Learning Research 10, 10 (2009).Google Scholar
Sebastian Raschka. 2018. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. CoRR abs/1811.12808 (2018). arXiv:1811.12808 http://arxiv.org/abs/1811.12808Google Scholar
Stefan Rueping. 2010. SVM classifier estimation from group probabilities. In Proceedings of the 27th International Conference on International Conference on Machine Learning. 911--918.Google ScholarDigital Library
Rishi Saket, Aravindan Raghuveer, and Balaraman Ravindran. 2022. On Combining Bags to Better Learn from Label Proportions. In International Conference on Artificial Intelligence and Statistics. PMLR, 5913--5927.Google Scholar
Clayton Scott and Jianxin Zhang. 2020. Learning from Label Proportions: A Mutual Contamination Framework. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 22256-22267. https://proceedings.neurips.cc/paper/2020/file/fcde14913c766cf307c75059e0e89af5-Paper.pdfGoogle Scholar
Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. 2017. Model-powered conditional independence test. Advances in neural information processing systems 30 (2017).Google Scholar
Yong Shi, Limeng Cui, Zhensong Chen, and Zhiquan Qi. 2019. Learning from label proportions with pinball loss. International Journal of Machine Learning and Cybernetics 10, 1 (2019), 187--205.Google ScholarCross Ref
Yong Shi, Jiabin Liu, and Zhiquan Qi. 2018. Inverse convolutional neural networks for learning from label proportions. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 643--646.Google ScholarCross Ref
Yong Shi, Jiabin Liu, Zhiquan Qi, and Bo Wang. 2018. Learning from label proportions on high-dimensional data. Neural Networks 103 (2018), 9--18.Google ScholarDigital Library
Yong Shi, Jiabin Liu, Bo Wang, Zhiquan Qi, and YingJie Tian. 2020. Deep learning from label proportions with labeled samples. Neural Networks 128 (2020), 73--81.Google ScholarCross Ref
Marco Stolpe and Katharina Morik. 2011. Learning from Label Proportions by Optimizing Cluster Model Selection. In Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part III (Athens, Greece) (ECML PKDD'11). Springer-Verlag, Berlin, Heidelberg, 349--364.Google ScholarDigital Library
Kuen-Han Tsai and Hsuan-Tien Lin. 2020. Learning from label proportions with consistency regularization. In Asian Conference on Machine Learning. PMLR, 513--528.Google Scholar
Yanshan Xiao, HuaiPei Wang, and Bo Liu. 2020. A new transfer learning-based method for label proportions problem. Information Sciences 541 (2020), 391--408.Google ScholarCross Ref
Felix Yu, Dong Liu, Sanjiv Kumar, Jebara Tony, and Shih-Fu Chang. 2013. proptoSVM for Learning with Label Proportions. In International Conference on Machine Learning. PMLR, 504--512.Google Scholar
Felix X Yu, Liangliang Cao, Michele Merler, Noel Codella, Tao Chen, John R Smith, and Shih-Fu Chang. 2014. Modeling attributes from category-attribute proportions. In Proceedings of the 22nd ACM international conference on Multimedia. 977--980.Google ScholarDigital Library
Felix X Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. 2014. On learning from label proportions. arXiv:1402.5902 (2014).Google Scholar
Fan Zhang, Jiabin Liu, Bo Wang, Zhiquan Qi, and Yong Shi. 2019. A Fast Algorithm for Multi-Class Learning from Label Proportions. Electronics 8, 6 (2019), 609.Google ScholarCross Ref

Index Terms

Dependence and Model Selection in LLP: The Problem of Variants
1. Computing methodologies
  1. Machine learning
    1. Cross-validation
    2. Learning settings
      1. Semi-supervised learning settings

Recommendations

LLP-AAE: Learning from label proportions with adversarial autoencoder
Abstract
This paper presents an effective weakly supervised learning algorithm LLP-AAE to leverage the adversarial autoencoder (AAE) for learning from label proportions (LLP), in which only the bag-level proportional information is available. ...
Read More
SELF-LLP: Self-supervised learning from label proportions with self-ensemble
Highlights
- A self-supervised learning is introduced to LLP, which leverages the advantage of self-supervision in representation learning to facilitate learning with weakly-supervised labels.
- A self-ensemble strategy is employed to provide pseudo “...
Abstract
In this paper, we tackle the problem called learning from label proportions (LLP), where the training data is arranged into various bags, with only the proportions of different categories in each bag available. Existing efforts mainly focus on ...
Read More
Robust multiple-instance learning ensembles using random subspace instance selection

Many real-world pattern recognition problems can be modeled using multiple-instance learning (MIL), where instances are grouped into bags, and each bag is assigned a label. State-of-the-art MIL methods provide a high level of performance when strong ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hyperparameter selection
learning from label proportions
weakly supervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 138
  Total Downloads
- Downloads (Last 12 months)138
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dependence and Model Selection in LLP: The Problem of Variants

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

LLP-AAE: Learning from label proportions with adversarial autoencoder

SELF-LLP: Self-supervised learning from label proportions with self-ensemble

Robust multiple-instance learning ensembles using random subspace instance selection