Abstract
Data quality, especially label quality, may have a significant impact on the prediction accuracy in supervised learning. Training on datasets with label noise causes a degradation in performance and a reduction in prediction accuracy. To overcome the numerical label noise problem in regression, we estimate the posterior distribution of the true label through the Gaussian mixture model (GMM). Then, label noise estimation is proposed by integrating the idea of maximum a posteriori (MAP) estimation with the posterior distribution. Besides, a noise filtering algorithm with MAP estimation (MAPNF) is designed by combining the optimal sample selection framework with the estimator. Extensive experiments are carried out on benchmark datasets and an age estimation dataset to verify the effectiveness of MAPNF. The results on benchmark datasets show that MAPNF outperforms other latest filtering algorithms in improving the generalization performance of different regression models, including noise-sensitive models and noise-robust models. The model error can be reduced by 29.7% to 69.6%. Our proposed approach can also identify erroneous labels in an age estimation dataset (total of 18424). The model trained on the filtered dataset (19% of the data removed) achieves a reduced test error on the dataset by at least 2.68%. The results demonstrate a less-is-better effect by achieving lower prediction errors with fewer high-quality samples. It can be concluded that MAPNF can effectively identify label noise and optimize the data quality.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Data will be made available on request.
References
Guo K, Cao R, Kui X et al (2019) LCC: towards efficient label completion and correction for supervised medical image learning in smart diagnosis. J Netw Comput Appl 133:51–59
Yang B, Wu J, Ikeda K et al (2022) Face-mask-aware facial expression recognition based on face parsing and vision transformer. Pattern Recognit Lett 164:173–182
Hossain MR, Hoque MM, Siddique N (2023) Leveraging the meta-embedding for text classification in a resource-constrained language. Eng Appl Artif Intell 124:106586
Mallikarjuna C, Sivanesan S (2022) Question classification using limited labelled data. Inf Process & Manag 59(6):103094
Ma B, Li C, Jiang L (2022) A novel ground truth inference algorithm based on instance similarity for crowdsourcing learning. Appl Intell 52:17784–17796
Wang K, Yang M, Yang W et al (2022) Dual-scale correlation analysis for robust multi-label classification. Appl Intell 52:16382–16397
Sabzevari M, Martínez-Muñoz G, Suárez A (2018) Vote-boosting ensembles. Pattern Recognit 83:119–133
Liu Y, Chen H, Li T et al (2023) A robust graph based multi-label feature selection considering feature-label dependency. Appl Intell 53(1):837–863
Shi J, Cao Z, Wu J (2022) Meta joint optimization: a holistic framework for noisy-labeled visual recognition. Appl Intell 52(1):875–888
Karimi D, Dou H, Warfield SK et al (2020) Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med Image Anal 65:101759
Cano JR, Luengo J, García S (2019) Label noise filtering techniques to improve monotonic classification. Neurocomputing 353:83–95
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52(1):273–292
Tsai CF, Lin WC, Hu YH et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Li J, Zhu Q, Wu Q (2019) A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor. Knowl-Based Syst 184:104895
Zhang A, Yu H, Huan Z et al (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88
Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlap problem in imbalanced data classification. Knowl-Based Syst 212:106631
Algan G, Ulusoy I (2021) Image classification with deep learning in the presence of noisy labels: a survey. Knowl-Based Syst 215:106771
Jiang GX, Fan RX, Wang WJ (2020) Label noise filtering via perception of nearest neighbors. Pattern Recognit Artif Intell 33(6):518–529
Blachnik M, Kordos M (2020) Comparison of instance selection and construction methods with various classifiers. Appl Sci 10(11):3933
Kordos M, Blachnik M, Scherer R (2022) Fuzzy clustering decomposition of genetic algorithm-based instance selection for regression problems. Inf Sci 587:23–40
Li C, Mao Z (2023) A label noise filtering method for regression based on adaptive threshold and noise score. Expert Syst Appl 228:120422
Yao J, Wang Z, Wang L et al (2022) Novel hybrid ensemble credit scoring model with stacking-based noise detection and weight assignment. Expert Syst Appl 198:116913
Luengo J, Shim SO, Alshomrani S et al (2018) CNC-NOS: Class Noise Cleaning by Ensemble Filtering and Noise Scoring. Knowl-Based Syst 140:27–49
Gong C, Wang Ph, Zg Su (2020) An interactive nonparametric evidential regression algorithm with instance selection. Soft Comput 24:3125–3140
Araújo RdA, Nedjah N, Oliveira AL et al (2019) A deep increasing-decreasing-linear neural network for financial time series prediction. Neurocomputing 347:59–81
Su L, Xiong L, Yang J (2023) Multi-Attn BLS: Multi-head attention mechanism with broad learning system for chaotic time series prediction. Appl Soft Comput 132:109831
Jiang G, Wang W, Qian Y et al (2021) A unified sample selection framework for output noise filtering: an error-bound perspective. J Mach Learn Res 22(18):1–66
Jiang GX, Wang WJ (2022) A numerical label noise filtering algorithm for regression. J Comput Res Develop 59(8):1639–1652
Bowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis. Oxford University Press Inc, New York
Dua D, Graff C (2018) UCI machine learning repository. University of California, Irvine, School of information and computer science. http://archive.ics.uci.edu/ml
Chang CC, Lin CJ (2018) LIBSVM data: Classification, regression, and multi-label. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Punyani P, Gupta R, Kumar A (2020) Neural networks for facial age estimation: a survey on recent advances. Artif Intell Rev 53:3299–3347
Agbo-Ajala O, Viriri S (2021) Deep learning approach for facial age classification: a survey of the state-of-the-art. Artif Intell Rev 54:179–213
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62276161, U21A20513, 62076154, 61906113), and the Fundamental Research Program of Shanxi Province (202303021221055).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, G., Li, Z. & Wang, W. Maximum a posteriori estimation and filtering algorithm for numerical label noise. Appl Intell 54, 8841–8855 (2024). https://doi.org/10.1007/s10489-024-05648-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05648-y