Skip to main content

On the Impact of Noisy Labels on Supervised Classification Models

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Abstract

The amount of data generated daily grows tremendously in virtually all domains of science and industry, and its efficient storage, processing and analysis pose significant practical challenges nowadays. To automate the process of extracting useful insights from raw data, numerous supervised machine learning algorithms have been researched so far. They benefit from annotated training sets which are fed to the training routine which elaborates a model that is further deployed for a specific task. The process of capturing real-world data may lead to acquring noisy observations, ultimately affecting the models trained from such data. The impact of the label noise is, however, under-researched, and the robustness of classic learners against such noise remains unclear. We tackle this research gap and not only thoroughly investigate the classification capabilities of an array of widely-adopted machine learning models over a variety of contamination scenarios, but also suggest new metrics that could be utilized to quantify such models’ robustness. Our extensive computational experiments shed more light on the impact of training set contamination on the operational behavior of supervised learners.

AMW was supported by the Silesian University of Technology, Faculty of Biomedical Engineering grant (07/010/BK_23/1023). JN was supported by the Silesian University of Technology Rector’s grant (02/080/RGJ22/0026).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Awasthi, P., Balcan, M.F., Haghtalab, N., Urner, R.: Efficient learning of linear separators under bounded noise (2015)

    Google Scholar 

  2. Balcan, M.F., Haghtalab, N.: Noise in classification (2020)

    Google Scholar 

  3. Beinecke, J., Heider, D.: Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making. BioData Min. 14(1), 49 (2021)

    Article  Google Scholar 

  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)

    Google Scholar 

  5. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020)

    Google Scholar 

  6. Dhar, S., Guo, J., Liu, J.J., Tripathi, S., Kurup, U., Shah, M.: A survey of on-device machine learning: an algorithms and learning theory perspective. ACM Trans. Internet Things 2(3), 3450494 (2021)

    Google Scholar 

  7. Duarte, J.M., Berton, L.: A review of semi-supervised learning for text classification. Artif. Intell. Rev. 56, 1–69 (2023). https://doi.org/10.1007/s10462-023-10393-8

  8. Es-sakali, N., Cherkaoui, M., Mghazli, M.O., Naimi, Z.: Review of predictive maintenance algorithms applied to HVAC systems. Energy Rep. 8, 1003–1012 (2022)

    Article  Google Scholar 

  9. Frenay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE TNNLS 25(5), 845–869 (2014)

    MATH  Google Scholar 

  10. Gupta, S., Gupta, A.: Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput. Sci. 161, 466–474 (2019)

    Article  Google Scholar 

  11. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IEEE WCCI, pp. 1322–1328 (2008)

    Google Scholar 

  12. Kawulok, M., Nalepa, J.: Towards robust SVM training from weakly labeled large data sets. In: Proceedings of IAPR ACPR, pp. 464–468 (2015)

    Google Scholar 

  13. Kotowski, K., Kucharski, D., et al.: Detecting liver cirrhosis in computed tomography scans using clinically-inspired and radiomic features. Comput. Biol. Med. 152, 106378 (2023)

    Article  Google Scholar 

  14. Leung, T., Song, Y., Zhang, J.: Handling label noise in video classification via multiple instance learning. In: Proceedings of IEEE ICCV, pp. 2056–2063 (2011)

    Google Scholar 

  15. Nalepa, J., Kotowski, K., et al.: Deep learning automates bidimensional and volumetric tumor burden measurement from MRI in pre- and post-operative glioblastoma patients. Comput. Biol. Med. 154, 106603 (2023)

    Article  Google Scholar 

  16. Nalepa, J., Myller, M., Kawulok, M.: Training- and test-time data augmentation for hyperspectral image segmentation. IEEE Geosci. Remote Sens. Lett. 17(2), 292–296 (2020)

    Article  Google Scholar 

  17. Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010)

    Article  Google Scholar 

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation (2020)

    Google Scholar 

  20. Pradana, W.A., Adiwijaya, K., Wisesty, U.N.: Implementation of support vector machine for classification of speech marked Hijaiyah letters based on Mel frequency cepstrum coefficient feature extraction. J. Phys. Conf. Ser. 971(1), 012050 (2018)

    Google Scholar 

  21. Sáez, J.A., Galar, M., Luengo, J., Herrera, F.: Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition. Knowl. Inf. Syst. 38(1), 179–206 (2012). https://doi.org/10.1007/s10115-012-0570-1

  22. Wijata, A.M., Nalepa, J.: Unbiased validation of the algorithms for automatic needle localization in ultrasound-guided breast biopsies. In: Proceedings of IEEE ICIP, pp. 3571–3575 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Agata M. Wijata or Jakub Nalepa .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 39 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dubel, R., Wijata, A.M., Nalepa, J. (2023). On the Impact of Noisy Labels on Supervised Classification Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36021-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36020-6

  • Online ISBN: 978-3-031-36021-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics