skip to main content
10.1145/3583780.3615146acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Retrieval-Based Unsupervised Noisy Label Detection on Text Data

Published:21 October 2023Publication History

ABSTRACT

The success of deep neural networks hinges on both high-quality annotations and copious amounts of data; however, in practice, a compromise between dataset size and quality frequently arises. Data collection and cleansing are often resource-intensive and time-consuming, leading to real-world datasets containing label noise that can introduce incorrect correlation patterns, adversely affecting model generalization capabilities. The efficient identification of corrupted patterns is indispensable, with prevalent methods predominantly concentrating on devising robust training techniques to preclude models from internalizing these patterns. Nevertheless, these supervised approaches often necessitate tailored training procedures, potentially resulting in overfitting corrupted patterns and a decline in detection performance. This paper presents a retrieval-based unsupervised solution for the detection of noisy labels, surpassing the performance of three current competitive methods in this domain.

References

  1. Ehsan Amid, Manfred K. Warmuth, Rohan Anil, and Tomer Koren. 2019. Robust Bi-Tempered Logistic Loss Based on Bregman Divergences. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 14987--14996. https://proceedings. neurips.cc/paper/2019/hash/8cd7775f9129da8b5bf787a063d8426e-Abstract.htmlGoogle ScholarGoogle Scholar
  2. Dara Bahri, Heinrich Jiang, and Maya R. Gupta. 2020. Deep k-NN for Noisy Labels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 540--550. http://proceedings.mlr.press/v119/bahri20a.htmlGoogle ScholarGoogle Scholar
  3. Pengfei Chen, Benben Liao, Guangyong Chen, and Shengyu Zhang. 2019. Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 1062--1070. http://proceedings.mlr.press/v97/chen19g.htmlGoogle ScholarGoogle Scholar
  4. Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu. 2021. Learning with Instance-Dependent Label Noise: A Sample Sieve Approach. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=2VXyy9mIyU3Google ScholarGoogle Scholar
  5. Derek Chong, Jenny Hong, and Christopher D. Manning. 2022. Detecting Label Errors by Using Pre-Trained Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7--11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 9074--9091. https://aclanthology.org/2022.emnlp-main.618Google ScholarGoogle Scholar
  6. Gianna M Del Corso, Antonio Gulli, and Francesco Romani. 2005. Ranking a stream of news. In WWW. ACM, 97--106.Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423Google ScholarGoogle Scholar
  8. Wei Gao, Bin-Bin Yang, and Zhi-Hua Zhou. 2016. On the resistance of nearest neighbor to random noisy labels. arXiv preprint arXiv:1607.07526 (2016).Google ScholarGoogle Scholar
  9. Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. 2017. Robust Loss Functions under Label Noise for Deep Neural Networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 1919--1925. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759Google ScholarGoogle Scholar
  10. Antonio Gulli. 2005. The anatomy of a news search engine. In Special interest tracks and posters of the 14th international conference on World Wide Web. ACM, 880--881.Google ScholarGoogle Scholar
  11. Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montréal, Canada, Samy Bengio, Hanna M.Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 8536--8546. https://proceedings.neurips.cc/paper/2018/hash/a19744e268754fb0148b017647355b7b-Abstract.htmlGoogle ScholarGoogle Scholar
  12. Jinchi Huang, Lie Qu, Rongfei Jia, and Binqiang Zhao. 2019. O2U-Net: A Simple Noisy Label Detection Approach for Deep Neural Networks. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 3325--3333. https://doi.org/10.1109/ICCV.2019.00342Google ScholarGoogle ScholarCross RefCross Ref
  13. Ahmet Iscen, Jack Valmadre, Anurag Arnab, and Cordelia Schmid. 2022. Learning with Neighbor Consistency for Noisy Labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 4662--4671. https://doi.org/10.1109/CVPR52688.2022.00463Google ScholarGoogle Scholar
  14. Heinrich Jiang, Been Kim, Melody Y. Guan, and Maya R. Gupta. 2018. To Trust Or Not To Trust A Classifier. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 5546--5557. https://proceedings.neurips.cc/paper/2018/hash/7180cffd6a8e829dacfc2a31b3f72ece-Abstract.htmlGoogle ScholarGoogle Scholar
  15. Zhimeng Jiang, Kaixiong Zhou, Zirui Liu, Li Li, Rui Chen, Soo-Hyun Choi, and Xia Hu. 2022. An Information Fusion Approach to Learning with Instance-Dependent Label Noise. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=ecH2FKaARUpGoogle ScholarGoogle Scholar
  16. Shuyu Kong, You Li, Jia Wang, Amin Rezaei, and Hai Zhou. 2020. KNNenhanced Deep Learning Against Noisy Labels. CoRR abs/2012.04224 (2020). arXiv:2012.04224 https://arxiv.org/abs/2012.04224Google ScholarGoogle Scholar
  17. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguistics 7 (2019), 452--466. https://doi.org/10.1162/tacl_a_00276Google ScholarGoogle ScholarCross RefCross Ref
  18. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia--a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167--195.Google ScholarGoogle ScholarCross RefCross Ref
  19. Junnan Li, Richard Socher, and Steven C. H. Hoi. 2020. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=HJgExaVtwrGoogle ScholarGoogle Scholar
  20. Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 18847--18856. https://doi.org/10.1109/CVPR52688.2022.01830Google ScholarGoogle ScholarCross RefCross Ref
  21. Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. 2022. Selective-Supervised Contrastive Learning with Noisy Labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 316--325. https://doi.org/10.1109/CVPR52688.2022.00041Google ScholarGoogle Scholar
  22. Peiyang Liu, Sen Wang, Xi Wang, Wei Ye, and Shikun Zhang. 2021. QuadrupletBERT: An Efficient Model For Embedding-Based Large-Scale Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6--11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 3734--3739. https://doi.org/10.18653/v1/2021.naacl-main.292Google ScholarGoogle ScholarCross RefCross Ref
  23. Peiyang Liu, Xi Wang, Lin Wang, Wei Ye, Xiangyu Xi, and Shikun Zhang. 2021. Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, Gianluca Demartini, Guido Zuccon, J. Shane Culpepper, Zi Huang, and Hanghang Tong (Eds.). ACM, 3965--3975. https://doi.org/10.1145/3459637.3481909Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Peiyang Liu, XiWang, SenWang,Wei Ye, Xiangyu Xi, and Shikun Zhang. 2021. Improving Embedding-based Large-scale Retrieval via Label Enhancement. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16--20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 133--142. https://doi.org/10.18653/v1/2021.findingsemnlp.13Google ScholarGoogle ScholarCross RefCross Ref
  25. Peiyang Liu, Xiangyu Xi, Wei Ye, and Shikun Zhang. 2022. Label Smoothing for Text Mining. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12--17, 2022, Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (Eds.). International Committee on Computational Linguistics, 2210--2219. https://aclanthology.org/2022.coling-1.193Google ScholarGoogle Scholar
  26. Yang Liu and Hongyi Guo. 2020. Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6226--6236. http://proceedings.mlr.press/v119/liu20e.htmlGoogle ScholarGoogle Scholar
  27. Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In ACL. Association for Computational Linguistics, 142--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. 2013. Learning with Noisy Labels. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, Nevada, United States, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 1196--1204. https://proceedings.neurips.cc/paper/2013/hash/3871bd64012152bfb53fdf04b401193f-Abstract.htmlGoogle ScholarGoogle Scholar
  29. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS Tarek Richard Besold, Antoine Bordes, Artur S. d'Avila Garcez, and Greg Wayne (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdfGoogle ScholarGoogle Scholar
  30. Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. 2021. Confident Learning: Estimating Uncertainty in Dataset Labels. J. Artif. Intell. Res. 70 (2021), 1373--1411. https://doi.org/10.1613/jair.1.12125Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). The Association for Computational Linguistics, 2383--2392. https://doi.org/10.18653/v1/d16--1264Google ScholarGoogle ScholarCross RefCross Ref
  32. Frederick Reiss, Hong Xu, Bryan Cutler, Karthik Muthuraman, and Zachary Eichenberger. 2020. Identifying incorrect labels in the CoNLL-2003 corpus. In Proceedings of the 24th conference on computational natural language learning. 215--226.Google ScholarGoogle ScholarCross RefCross Ref
  33. Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. 2022. PiCO: Contrastive Label Disambiguation for Partial Label Learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=EhYjZy6e1gJGoogle ScholarGoogle Scholar
  34. Yikai Wang, Xinwei Sun, and Yanwei Fu. 2022. Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 346--355. https://doi.org/10.1109/CVPR52688.2022.00044Google ScholarGoogle Scholar
  35. Zhuowei Wang, Jing Jiang, Bo Han, Lei Feng, Bo An, Gang Niu, and Guodong Long. 2020. SemiNLL: A Framework of Noisy-Label Learning by Semi-Supervised Learning. CoRR abs/2012.00925 (2020). arXiv:2012.00925 https://arxiv.org/abs/2012.00925Google ScholarGoogle Scholar
  36. Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. 2020. Combating Noisy Labels by Agreement: A Joint Training Method with Co-Regularization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation / IEEE, 13723--13732. https://doi.org/10.1109/CVPR42600.2020.01374Google ScholarGoogle ScholarCross RefCross Ref
  37. Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan Wang, Zongyuan Ge, and Yi Chang. 2021. Robust early-learning: Hindering the memorization of noisy labels. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=Eql5b1_hTE4Google ScholarGoogle Scholar
  38. Mingyuan Zhang, Jane Lee, and Shivani Agarwal. 2021. Learning from Noisy Labels with No Change to the Training Process. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12468--12478. http://proceedings.mlr.press/v139/zhang21k.htmlGoogle ScholarGoogle Scholar
  39. Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 649--657. https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. 2019. Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 1237--1246. https://doi.org/10.1109/CVPR.2019.00133Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhaowei Zhu, Zihao Dong, and Yang Liu. 2022. Detecting Corrupted Labels Without Training a Model to Predict. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 27412--27427. https://proceedings.mlr.press/v162/zhu22a.htmlGoogle ScholarGoogle Scholar
  42. Zhaowei Zhu, Yiwen Song, and Yang Liu. 2021. Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12912--12923. http://proceedings.mlr.press/v139/zhu21e.htmlGoogle ScholarGoogle Scholar

Index Terms

  1. Retrieval-Based Unsupervised Noisy Label Detection on Text Data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
        October 2023
        5508 pages
        ISBN:9798400701245
        DOI:10.1145/3583780

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      • Article Metrics

        • Downloads (Last 12 months)77
        • Downloads (Last 6 weeks)10

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader