research-article

Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks

Authors:

Mengmeng Kuang,

Qiang YanAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 526 - 534

https://doi.org/10.1145/3488560.3498482

Published: 15 February 2022 Publication History

Abstract

The existence of noisy labels in datasets has always been an essential dilemma in deep learning studies. Previous works detected noisy labels by analyzing the predicted probability distribution generated by the model trained on the same data and calculating the probabilities of each label to be regarded as noise. However, the predicted probability distribution from the whole dataset may introduce overfitting, and the overfitting on noisy labels may induce the probability distribution of clean and noisy items to be not conditional independent, making identification more challenging. Additionally, label noise reduction on image datasets has received much attention, while label noise reduction on text datasets has not. This paper proposes a noisy label reduction method for text datasets, which could be applied at retrieval-based tasks by getting a conditional independent probability distribution to identify noisy labels accurately. The method first generates a candidate set containing noisy labels, predicts the category probabilities by the model trained on the rest cleaner data, and then identifies noisy items by analyzing a confidence matrix. Moreover, we introduce a warm-up module and a sharpened cross-entropy loss function for efficiently training in the first stage. Empirical results on different rates of uniform and random label noise in five text datasets demonstrate that our method can improve the label noise reduction accuracy and end-to-end classification accuracy. Further, we find that the iteration of the label noise reduction method is efficient to high-rate label noise datasets, and our method will not hurt clean datasets too much.

Supplementary Material

MOV File (WSDM22-wsdmfp562.mov)

The presentation video of the paper "Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks".

Download
142.89 MB

References

[1]

Dana Angluin and Philip Laird. 1988. Learning from noisy examples. Machine Learning, Vol. 2, 4 (1988), 343--370.

Digital Library

[2]

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019).

[3]

Gavin C Cawley and Nicola LC Talbot. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, Vol. 11 (2010), 2079--2107.

Digital Library

[4]

Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. 2019. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning. PMLR, 1062--1070.

[5]

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. Association for Computational Linguistics, Online, 657--668. https://www.aclweb.org/anthology/2020.findings-emnlp.58

[6]

Beno^it Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, Vol. 25, 5 (2013), 845--869.

[7]

Jacob Goldberger and Ehud Ben-Reuven. 2016. Training deep neural-networks using a noise adaptation layer. (2016).

[8]

Gaurav Gupta, Anit Kumar Sahu, and Wan-Yi Lin. 2019. Learning in Confusion: Batch Active Learning with Noisy Oracle. arXiv preprint arXiv:1909.12473 (2019).

[9]

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872 (2018).

[10]

Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553--2561.

Digital Library

[11]

Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning. PMLR, 4804--4815.

[12]

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. PMLR, 2304--2313.

[13]

Ishan Jindal, Daniel Pressel, Brian Lester, and Matthew Nokleby. 2019. An effective label noise model for dnn text classification. arXiv preprint arXiv:1903.07507 (2019).

[14]

Junnan Li, Richard Socher, and Steven CH Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020).

[15]

Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval .Springer. https://doi.org/10.1007/978--3--642--14267--3

Digital Library

[16]

Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. 2018. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning. PMLR, 3355--3364.

[17]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142--150. http://www.aclweb.org/anthology/P11--1015

Digital Library

[18]

Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. 2019. Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842 (2019).

[19]

Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021 b. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, Vol. 70 (2021), 1373--1411.

Digital Library

[20]

Curtis G Northcutt, Anish Athalye, and Jonas Mueller. 2021 a. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).

[21]

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1944--1952.

[22]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).

[23]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000

[24]

questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).

[25]

Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Found. Trends Inf. Retr., Vol. 4, 4 (2010), 247--375. https://doi.org/10.1561/1500000009

[26]

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation, Vol. 13, 7 (2001), 1443--1471.

[27]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.

[28]

Yi Sun, Yan Tian, Yiping Xu, and Jianxiang Li. 2019. Limited gradient descent: Learning with noisy labels. IEEE Access, Vol. 7 (2019), 168296--168306.

[29]

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2018. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018).

[30]

Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. 2019. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 322--330.

[31]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38--45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[32]

Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. 2015. Learning discriminative reconstructions for unsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision. 1511--1519.

Digital Library

[33]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).

[34]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, Vol. 28 (2015), 649--657.

[35]

Zhilu Zhang and Mert R Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (NeurIPS) .

Digital Library

Cited By

Lin MTang JFu LZuo Z(2025)SANet: Selective Aggregation Network for unsupervised object re-identificationComputer Vision and Image Understanding10.1016/j.cviu.2024.104232250(104232)Online publication date: Jan-2025
https://doi.org/10.1016/j.cviu.2024.104232
Kim YLee S(2022)Data Valuation Algorithm for Inertial Measurement Unit-Based Human Activity RecognitionSensors10.3390/s2301018423:1(184)Online publication date: 24-Dec-2022
https://doi.org/10.3390/s23010184
Dereka SKarpukhin IKolesnikov S(2022)Deep Image Retrieval is not Robust to Label Noise2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00545(4971-4976)Online publication date: Jun-2022
https://doi.org/10.1109/CVPRW56347.2022.00545

Index Terms

Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Nrat: towards adversarial training with inherent label noise
Abstract
Adversarial training (AT) has been widely recognized as the most effective defense approach against adversarial attacks on deep neural networks and it is formulated as a min-max optimization. Most AT algorithms are geared towards research-oriented ...
Multi-label Text Classification with Label Correction under Noise
ICCPR '21: Proceedings of the 2021 10th International Conference on Computing and Pattern Recognition

Multi-label text classification (MLTC) is a fundamental but difficult problem in text mining, the goal of MLTC is to assign a set of most relevant labels for the given document. While existing supervised training of deep learning models for MLTC ...
An instance-dependent simulation framework for learning with label noise
Abstract
We propose a simulation framework for generating instance-dependent noisy labels via a pseudo-labeling paradigm. We show that the distribution of the synthetic noisy labels generated with our framework is closer to human labels compared to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

February 2022

1690 pages

ISBN:9781450391320

DOI:10.1145/3488560

General Chairs:
K. Selcuk Candan
Arizona State University, USA
,
Huan Liu
Arizona State University, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Xin Luna Dong
Meta Platforms, Inc. (former Facebook), USA
,
Jiliang Tang
Michigan State University, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '22

Sponsor:

WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining

February 21 - 25, 2022

AZ, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin MTang JFu LZuo Z(2025)SANet: Selective Aggregation Network for unsupervised object re-identificationComputer Vision and Image Understanding10.1016/j.cviu.2024.104232250(104232)Online publication date: Jan-2025
https://doi.org/10.1016/j.cviu.2024.104232
Kim YLee S(2022)Data Valuation Algorithm for Inertial Measurement Unit-Based Human Activity RecognitionSensors10.3390/s2301018423:1(184)Online publication date: 24-Dec-2022
https://doi.org/10.3390/s23010184
Dereka SKarpukhin IKolesnikov S(2022)Deep Image Retrieval is not Robust to Label Noise2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00545(4971-4976)Online publication date: Jun-2022
https://doi.org/10.1109/CVPRW56347.2022.00545

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten