skip to main content
10.1145/3488560.3498482acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks

Published: 15 February 2022 Publication History

Abstract

The existence of noisy labels in datasets has always been an essential dilemma in deep learning studies. Previous works detected noisy labels by analyzing the predicted probability distribution generated by the model trained on the same data and calculating the probabilities of each label to be regarded as noise. However, the predicted probability distribution from the whole dataset may introduce overfitting, and the overfitting on noisy labels may induce the probability distribution of clean and noisy items to be not conditional independent, making identification more challenging. Additionally, label noise reduction on image datasets has received much attention, while label noise reduction on text datasets has not. This paper proposes a noisy label reduction method for text datasets, which could be applied at retrieval-based tasks by getting a conditional independent probability distribution to identify noisy labels accurately. The method first generates a candidate set containing noisy labels, predicts the category probabilities by the model trained on the rest cleaner data, and then identifies noisy items by analyzing a confidence matrix. Moreover, we introduce a warm-up module and a sharpened cross-entropy loss function for efficiently training in the first stage. Empirical results on different rates of uniform and random label noise in five text datasets demonstrate that our method can improve the label noise reduction accuracy and end-to-end classification accuracy. Further, we find that the iteration of the label noise reduction method is efficient to high-rate label noise datasets, and our method will not hurt clean datasets too much.

Supplementary Material

MOV File (WSDM22-wsdmfp562.mov)
The presentation video of the paper "Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks".

References

[1]
Dana Angluin and Philip Laird. 1988. Learning from noisy examples. Machine Learning, Vol. 2, 4 (1988), 343--370.
[2]
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019).
[3]
Gavin C Cawley and Nicola LC Talbot. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, Vol. 11 (2010), 2079--2107.
[4]
Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. 2019. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning. PMLR, 1062--1070.
[5]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. Association for Computational Linguistics, Online, 657--668. https://www.aclweb.org/anthology/2020.findings-emnlp.58
[6]
Beno^it Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, Vol. 25, 5 (2013), 845--869.
[7]
Jacob Goldberger and Ehud Ben-Reuven. 2016. Training deep neural-networks using a noise adaptation layer. (2016).
[8]
Gaurav Gupta, Anit Kumar Sahu, and Wan-Yi Lin. 2019. Learning in Confusion: Batch Active Learning with Noisy Oracle. arXiv preprint arXiv:1909.12473 (2019).
[9]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872 (2018).
[10]
Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553--2561.
[11]
Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning. PMLR, 4804--4815.
[12]
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. PMLR, 2304--2313.
[13]
Ishan Jindal, Daniel Pressel, Brian Lester, and Matthew Nokleby. 2019. An effective label noise model for dnn text classification. arXiv preprint arXiv:1903.07507 (2019).
[14]
Junnan Li, Richard Socher, and Steven CH Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020).
[15]
Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval .Springer. https://doi.org/10.1007/978--3--642--14267--3
[16]
Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. 2018. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning. PMLR, 3355--3364.
[17]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142--150. http://www.aclweb.org/anthology/P11--1015
[18]
Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. 2019. Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842 (2019).
[19]
Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021 b. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, Vol. 70 (2021), 1373--1411.
[20]
Curtis G Northcutt, Anish Athalye, and Jonas Mueller. 2021 a. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).
[21]
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1944--1952.
[22]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
[23]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000
[24]
questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
[25]
Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Found. Trends Inf. Retr., Vol. 4, 4 (2010), 247--375. https://doi.org/10.1561/1500000009
[26]
Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation, Vol. 13, 7 (2001), 1443--1471.
[27]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.
[28]
Yi Sun, Yan Tian, Yiping Xu, and Jianxiang Li. 2019. Limited gradient descent: Learning with noisy labels. IEEE Access, Vol. 7 (2019), 168296--168306.
[29]
Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2018. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018).
[30]
Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. 2019. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 322--330.
[31]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38--45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
[32]
Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. 2015. Learning discriminative reconstructions for unsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision. 1511--1519.
[33]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[34]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, Vol. 28 (2015), 649--657.
[35]
Zhilu Zhang and Mert R Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (NeurIPS) .

Cited By

View all
  • (2025)SANet: Selective Aggregation Network for unsupervised object re-identificationComputer Vision and Image Understanding10.1016/j.cviu.2024.104232250(104232)Online publication date: Jan-2025
  • (2022)Data Valuation Algorithm for Inertial Measurement Unit-Based Human Activity RecognitionSensors10.3390/s2301018423:1(184)Online publication date: 24-Dec-2022
  • (2022)Deep Image Retrieval is not Robust to Label Noise2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00545(4971-4976)Online publication date: Jun-2022

Index Terms

  1. Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
      February 2022
      1690 pages
      ISBN:9781450391320
      DOI:10.1145/3488560
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data cleaning
      2. noisy labels
      3. retrieval-based tasks
      4. text dataset

      Qualifiers

      • Research-article

      Conference

      WSDM '22

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)SANet: Selective Aggregation Network for unsupervised object re-identificationComputer Vision and Image Understanding10.1016/j.cviu.2024.104232250(104232)Online publication date: Jan-2025
      • (2022)Data Valuation Algorithm for Inertial Measurement Unit-Based Human Activity RecognitionSensors10.3390/s2301018423:1(184)Online publication date: 24-Dec-2022
      • (2022)Deep Image Retrieval is not Robust to Label Noise2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00545(4971-4976)Online publication date: Jun-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media