research-article

Semi-Supervised Ensemble Learning for Dealing with Inaccurate and Incomplete Supervision

Authors: Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh QuaderAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 3

Article No.: 43, Pages 1 - 33

https://doi.org/10.1145/3473910

Published: 22 October 2021 Publication History

Abstract

In real-world tasks, obtaining a large set of noise-free data can be prohibitively expensive. Therefore, recent research tries to enable machine learning to work with weakly supervised datasets, such as inaccurate or incomplete data. However, the previous literature treats each type of weak supervision individually, although, in most cases, different types of weak supervision tend to occur simultaneously. Therefore, in this article, we present Smart MEnDR, a Classification Model that applies Ensemble Learning and Data-driven Rectification to deal with inaccurate and incomplete supervised datasets. The model first applies a preliminary phase of ensemble learning in which the noisy data points are detected while exploiting the unlabelled data. The phase employs a semi-supervised technique with maximum likelihood estimation to decide on the disagreement rate. Second, the proposed approach applies an iterative meta-learning step to tackle the problem of knowing which points should be made correct to improve the performance of the final classifier. To evaluate the proposed framework, we report the classification performance, noise detection, and the labelling accuracy of the proposed method against state-of-the-art techniques. The experimental results demonstrate the effectiveness of the proposed framework in detecting noise, providing correct labels, and attaining high classification performance.

References

[1]

Y. F. Li, L. Z. Guo, and Z. H. Zhou. 2019. Towards safe weakly supervised learning. IEEE TransActions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 334–346.

[2]

S. H. Bach, B. He, A. Ratner, and C. Ré. 2017. Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, 273–282.

[3]

P. Cheng, X. Lian, X. Jian, and L. Chen. 2019. FROG: A fast and reliable crowdsourcing framework. IEEE TransActions on Knowledge and Data Engineering 31, 5 (2019), 894–908.

[4]

J. Luengo, S.-O. Shim, S. Alshomrani, A. Altalhi, and F. Herrera. 2018. CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring. Knowledge-Based Systems 140, C (2018), 27–49.

Digital Library

[5]

D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling. 2014. Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Advances in Neural Information Processing Systems. 3581–3589.

[6]

Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 2 (2013), 249–283.

[7]

M. Poel. 2017. Detecting mislabeled data using supervised machine learning techniques. In Proceedings of the 11th International Conference on Augmented Cognition. Dylan D. Schmorrow and Cali M. Fidopiastis (Eds.), Lecture Notes in Computer Science (Neurocognition and Machine Learning).

[8]

M. Sabzevari, G. Martínez-Muñoz, and A. Suárez. 2018. A two-stage ensemble method for the detection of class-label noise. Neurocomputing 275, C (2018), 2374–2383.

Digital Library

[9]

L. P. F. Garcia, A. C. Lorena, S. Matwin, and A. C. P. L. F. de Carvalho. 2016. Ensembles of label noise filters: A ranking approach. Data Mining and Knowledge Discovery 30, 5 (2016), 1192–1216.

Digital Library

[10]

D. Guan, H. Wei, W. Yuan, G. Han, Y. Tian, M. Al-Dhelaan, and A. Al-Dhelaan. 2018. Improving label noise filtering by exploiting unlabeled data. IEEE Access 6, 1-1 (2018), 11154–11165.

[11]

S. García, J. Luengo, and F. Herrera. 2015. Dealing with noisy data. Data Preprocessing in Data Mining, Vol. 72. Springer, Cham, 107–145.

[12]

A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow. 2018. Realistic evaluation of deep semi-supervised learning algorithms. In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems. 3235–3246.

[13]

P. Varma, B. He, P. Bajaj, I. Banerjee, N. Khandwala, D. L. Rubin, and C. Ré. 2017. Inferring generative model structure with static analysis. In Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems. 2017.

Digital Library

[14]

L.-Z. Guo, F. Kuang, Z.-X. Liu, Y.-F. Li, N. Ma, and X.-H. Qie. Weakly supervised learning meets ride-sharing user experience enhancement. arXiv:2001.09027. Retrieved from https://arxiv.org/abs/2001.09027.

[15]

Z.-Y. Zhang, P. Zhao, Y. Jiang, and Z.-H. Zhou. 2019. Learning from incomplete and inaccurate supervision. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, 1017–1025.

Digital Library

[16]

K. Chen, D. Guan, W. Yuan, B. Li, A. M. Khattak, and O. Alfandi. 2018. A novel feature selection-based sequential ensemble learning method for class noise detection in high-dimensional data. In Proceedings of the Advances in Data Mining and Applications. Lecture Notes in Computer Science, Vol. 11323, Springer-Verlag, 55–65.

Digital Library

[17]

R. Saman, A. Ali, and J. Licheng. 2019. Rough-KNN noise-filtered convolutional neural network for image classification. Frontiers in Artificial Intelligence and Applications, Vol, 314. IOS Press. 265–275.

[18]

J. A. Sáez, M. Galar, J. Luengo, and F. Herrera. 2016. INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion 27 (2016), 19–32.

Digital Library

[19]

B. Frenay and M. Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE TransActions on Neural Networks and Learning Systems 25, 5 (2014), 845–869.

[20]

C. J. Mantas, J. Abellán, and J. G. Castellano. 2016. Analysis of credal-C4.5 for classification in noisy domains. Expert Systems with Applications 61, C (2016), 314–326.

Digital Library

[21]

Q. Miao, Y. Cao, G. Xia, M. Gong, J. Liu, and J. Song. 2016. RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems 27, 11 (2016), 2216–2228.

[22]

P. Yang, J. T. Ormerod, W. Liu, C. Ma, A. Y. Zomaya, and J. Y. H. Yang. 2019. AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics 49, 5 (2019), 1932–1943.

[23]

X. Liu, D. Zachariah, J. Wågberg, and T. B. Schön. 2019. Reliable semi-supervised learning when labels are missing at random. arXiv:1811.10947. Retrieved from https://arxiv.org/abs1811.10947.

[24]

V. Jain, N. Modhe, and P. Rai. 2017. Scalable generative models for multi-label learning with missing labels. In Proceedings of the 34th International Conference on Machine Learning Research. 1636–1644.

[25]

B. Du, T. Xinyao, Z. Wang, L. Zhang, and D. Tao. 2019. Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion. IEEE Transactions on Cybernetics 49, 4 (2019), 1440–1453.

[26]

Y. Ding, S. Yan, Y. Zhang, W. Dai, and L. Dong. 2016. Predicting the attributes of social network users using a graph-based machine learning method. Computer Communications 73, Part A (Jan. 2016) 3–11.

[27]

Z.-H. Zhou. 2017. A brief introduction to weakly supervised learning. National Science Review 5, 1 (2017), 44–53.

[28]

M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic. 2017. Active learning: an empirical study of common baselines. Data Mining and Knowledge Discovery 31, 2 (2017), 287–313.

Digital Library

[29]

M. R. Bouguelia, S. Nowaczyk, K. C. Santosh, and A. Verikas. 2018. Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning & Cybernetics 9, 8 (2018), 1307–1319.

[30]

C. H. Lin, M. Mausam, and D. S. Weld. 2018. Active learning with unbalanced classes and example-generation queries. In Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing. 2018.

[31]

Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. 2015. Multi-Class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113, 2 (2015), 113–127.

Digital Library

[32]

R. C. Prati, J. Luengo, and F. Herrera. 2019. Emerging topics and challenges of learning from noisy data in nonstandard classification: A survey beyond binary class noise. Knowledge and Information Systems 60, 1 (2019), 63–97.

Digital Library

[33]

M. Sabzevari, G. Martínez-Muñoz, and A. Suárez. 2015. Small margin ensembles can be robust to class-label noise. Neurocomputing 160, C (2015), 18-33.

Digital Library

[34]

Y. Shang, He-Yan Huang, Xian-Ling Mao, X. Sun, and W. Wei. 2020. Are noisy sentences useless for distant supervised relation extraction? In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34.

[35]

Y.-F. Li, L.-Z. Guo, and Z.-H. Zhou. 2021. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (Jan. 2021), 334–346, DOI:https://doi.org/10.1109/TPAMI.2019.2922396

Digital Library

[36]

M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An End-To-End development framework for predictive models in B2B scenarios. Information and Software Technology 113, (2019), 131–145.

[37]

M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. In Proceedings of the 2018 IEEE Conference on Big Data. 2018.

[38]

J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Machine Learning Research 7, 1 (2006), 1–30.

Digital Library

[39]

W. Gao, B. B. Yang, and Z. H. Zhou. 2018. On the resistance of nearest neighbor to random noisy labels. arXiv:1607.07526. Retrieved from http://arxiv.org/abs/1607.07526.

[40]

M. Sabzevari, G. Martínez-Muñoz, and A. Suárez. 2018. Vote-boosting ensembles. Pattern Recognition 83 (Nov. 2018), 119–133.

[41]

M. R. Smith and T. Martinez. 2018. The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks. Artificial Intelligence Review 49, 1 (2018), 105–130.

Digital Library

[42]

D. Hao, L. Zhang, J. Sumkin, A. Mohamed, and S. Wu. 2020. Inaccurate labels in weakly-supervised deep learning: Automatic identification and correction and their impact on classification performance. IEEE Journal of Biomedical and Health Informatics 24, 9 (Sep. 2020), 2701–2710, DOI:https://doi.org/10.1109/JBHI.2020.2974425

[43]

B. Wu, F. Jia, W. Liu, B. Ghanem, and S. Lyu. 2018. Multi-label learning with missing labels using mixed dependency graphs. International Journal of Computer Vision 126, 8 (2018), 875–896.

Digital Library

[44]

Y. Cong, G. Sun, J. Liu, H. Yu, and J. Luo. 2018. User attribute discovery with missing labels. Pattern Recognition 73 (2018), 33–46.

[45]

Y. Lin, Y. Gou, Z. Liu, B. Li, J. Lv, and X. Peng. COMPLETER: Incomplete multi-view clustering via contrastive prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11174–11183.

[46]

M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng. Partially view-aligned representation learning with noise-robust contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1134–1143.

[47]

B. Liu, K. Blekas, and G. Tsoumakas. 2021. Multi-Label Sampling based on Local Label Imbalance. Pattern Recognition (In press).

[48]

J. Zhang, S. Li, M. Jiang, and K. C. Tan. 2020. Learning from weakly labeled data based on manifold regularized sparse model. IEEE Transactions on Cybernetics, Vol. PP, Sep. 2020, DOI:https://doi.org/10.1109/TCYB.2020.3015269

[49]

C. Li, L. Jiang, and W. Xu. 2019. Noise correction to improve data and model quality for crowdsourcing. Engineering Applications of Artificial Intelligence 82, (2019), 184–191.

[50]

Z. Huang, P. Hu, J. T. Zhou, J. Lv, and X. Peng. 2020. Partially view-aligned clustering. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 2020.

Cited By

Nashaat MMiller J(2024)Towards Efficient Fine-Tuning of Language Models With Organizational Data for Automated Software ReviewIEEE Transactions on Software Engineering10.1109/TSE.2024.342832450:9(2240-2253)Online publication date: Sep-2024
https://doi.org/10.1109/TSE.2024.3428324
Gao CShi YZhou CLei BThembelihle Mukondiwa D(2024)Joint task semi-supervised semantic segmentation for TRUS imageBiomedical Signal Processing and Control10.1016/j.bspc.2023.10565488(105654)Online publication date: Feb-2024
https://doi.org/10.1016/j.bspc.2023.105654
El-Shekheby S(2024)Real-Time Facial Emotion Recognition Using Haar-Cascade Classifier and MobileNet in Smart CitiesEngineering Solutions Toward Sustainable Development10.1007/978-3-031-46491-1_33(545-553)Online publication date: 17-Jan-2024
https://doi.org/10.1007/978-3-031-46491-1_33
Show More Cited By

Index Terms

Semi-Supervised Ensemble Learning for Dealing with Inaccurate and Incomplete Supervision
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Data management systems
    1. Database administration
      1. Autonomous database administration

Recommendations

Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 3

June 2022

494 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3485152

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Accepted: 01 July 2021

Revised: 01 May 2021

Received: 01 May 2020

Published in TKDD Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
511
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nashaat MMiller J(2024)Towards Efficient Fine-Tuning of Language Models With Organizational Data for Automated Software ReviewIEEE Transactions on Software Engineering10.1109/TSE.2024.342832450:9(2240-2253)Online publication date: Sep-2024
https://doi.org/10.1109/TSE.2024.3428324
Gao CShi YZhou CLei BThembelihle Mukondiwa D(2024)Joint task semi-supervised semantic segmentation for TRUS imageBiomedical Signal Processing and Control10.1016/j.bspc.2023.10565488(105654)Online publication date: Feb-2024
https://doi.org/10.1016/j.bspc.2023.105654
El-Shekheby S(2024)Real-Time Facial Emotion Recognition Using Haar-Cascade Classifier and MobileNet in Smart CitiesEngineering Solutions Toward Sustainable Development10.1007/978-3-031-46491-1_33(545-553)Online publication date: 17-Jan-2024
https://doi.org/10.1007/978-3-031-46491-1_33
Nashaat MNashaat H(2024)Utilizing Deep Reinforcement Learning for Resource Scheduling in Virtualized CloudsEngineering Solutions Toward Sustainable Development10.1007/978-3-031-46491-1_28(471-484)Online publication date: 17-Jan-2024
https://doi.org/10.1007/978-3-031-46491-1_28
Ren ZWang SZhang Y(2023)Weakly supervised machine learningCAAI Transactions on Intelligence Technology10.1049/cit2.12216Online publication date: 28-Apr-2023
https://doi.org/10.1049/cit2.12216

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents