DADR: A Denoising Approach for Dense Retrieval Model Training

Du, Mengxue; Li, Shasha; Yu, Jie; Ma, Jun; Liu, Huijun; Li, Miaomiao; Ji, Bin

doi:10.1007/978-981-97-2387-4_11

Mengxue Du^12,13,
Shasha Li^12,13,
Jie Yu^12,13,
Jun Ma^12,13,
Huijun Liu^12,13,
Miaomiao Li^12,13 &
…
Bin Ji^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14333))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

59 Accesses

Abstract

With the development of representation learning techniques, Dense Retrieval (DR) has become a new paradigm to retrieve relevant texts for better ranking performance. Although current DR models have achieved encouraging results, their performance is highly affected by the noise level in training samples. In particular, a large number of examples that were not labeled as positives (which were used as negative samples by default) were found to actually be positive or highly relevant. As such, it is of critical importance to account for the inevitable noises in DR model training. However, little work on dense retrieval has taken the noisy nature into consideration. In this work, we intensely investigate the serious negative impacts of noisy training samples and propose a new denoising approach, i.e., A Denoising Approach based on dynamic weights for Dense Retrieval model training (DADR), which reduces the effects of noise on model performance by assigning diverse weights to the different samples during the training process. We incorporate the proposed DADR approach with three representative kinds of sampling methods and different loss functions. Experimental results on two publicly available retrieval benchmark datasets show that our approach significantly improves the performance of the DR model over normal training.

This work was supported by Hunan Provincial Natural Science Foundation Project (No. 2022JJ30668) and (No. 2022JJ30046).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Unsupervised label noise modeling and loss correction (2019)
Google Scholar
Arazo, E., Ortego, D., Albert, P., O’Connor, N., McGuinness, K.: Unsupervised label noise modeling and loss correction. In: International Conference on Machine Learning, pp. 312–321. PMLR (2019)
Google Scholar
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data (1999)
Google Scholar
Burges, C.J.C.: From ranknet to lambdarank to lambdamart: an overview (2010)
Google Scholar
Chen, Y., Zhou, D., Li, L., Han, J.M.: Multimodal encoders for food-oriented cross-modal retrieval. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds.) Web and Big Data. APWeb-WAIM 2021. LNCS, vol. 12859, pp. 253–266. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85899-5_19
Cheng, M., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)
Google Scholar
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the Trec 2019 deep learning track. Text REtrieval Conference (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2018)
Google Scholar
Du, M., et al.: Topic-grained text representation-based model for document retrieval. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds.) Artificial Neural Networks and Machine Learning. ICANN 2022. LNCS, vol. 13531, pp. 776–788. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15934-3_64
Gao, L., Dai, Z., Fan, Z., Callan, J.: Complementing lexical retrieval with semantic residual embedding. Cornell University - arXiv (2020)
Google Scholar
Gao, Y., et al.: Self-guided learning to denoise for robust recommendation (2022)
Google Scholar
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics (2010)
Google Scholar
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training. arXiv : Computation and Language (2020)
Google Scholar
Huang, J., Qu, L., Jia, R., Zhao, B.: O2u-net: a simple noisy label detection approach for deep neural networks (2019)
Google Scholar
Huang, J.T., et al.: Embedding-based retrieval in Facebook search (2020)
Google Scholar
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781 (2020)
Google Scholar
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. arXiv : Computation and Language (2020)
Google Scholar
Khattab, O., Zaharia, M.: Colbert: efficient and effective passage search via contextualized late interaction over bert. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48 (2020)
Google Scholar
Kim, D., Koo, J., Kim, U.M.: Osp-class: open set pseudo-labeling with noise robust training for text classification. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 5520–5529. IEEE (2022)
Google Scholar
Li, J., Socher, R., Hoi, S.C.H.: Dividemix: learning with noisy labels as semi-supervised learning. arXiv : Computer Vision and Pattern Recognition (2020)
Google Scholar
Li, Y., Liu, S., She, Q., Mcleod, A., Wang, B.: On learning contrastive representations for learning with noisy labels (2023)
Google Scholar
Northcutt, C.G., Jiang, L., Chuang, I.L.: Confident learning: estimating uncertainty in dataset labels. arXiv : Machine Learning (2019)
Google Scholar
Parker, B., Sokolov, A., Ahmed, M., Kalebic, M., Akinli Kocak, S., Shai, O.: Domain specific fine-tuning of denoising sequence-to-sequence models for natural language summarization. arXiv e-prints pp. arXiv–2204 (2022)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Neural Inf. Process. Syst. (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Qu, Y., et al.: Rocketqa: an optimized training approach to dense passage retrieval for open-domain question answering (2020)
Google Scholar
Ren, R., et al.: Rocketqav2: a joint training method for dense passage retrieval and passage re-ranking. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2825–2835 (2021)
Google Scholar
Toneva, M., Sordoni, A., des Combes, R.T., Trischler, A., Bengio, Y., Gordon, G.J.: An empirical study of example forgetting during deep neural network learning. In: International Conference on Learning Representations (2018)
Google Scholar
Wang, W., Feng, F., He, X., Nie, L., Chua, T.S.: Denoising implicit feedback for recommendation. arXiv : Information Retrieval (2020)
Google Scholar
Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives. Cornell University - arXiv (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, China
Mengxue Du, Shasha Li, Jie Yu, Jun Ma, Huijun Liu, Miaomiao Li & Bin Ji
Institute of Data Science, National University of Singapore, Singapore, Singapore
Mengxue Du, Shasha Li, Jie Yu, Jun Ma, Huijun Liu, Miaomiao Li & Bin Ji

Authors

Mengxue Du
View author publications
You can also search for this author in PubMed Google Scholar
Shasha Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ma
View author publications
You can also search for this author in PubMed Google Scholar
Huijun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Miaomiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Yu .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, M. et al. (2024). DADR: A Denoising Approach for Dense Retrieval Model Training. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_11

Download citation

DOI: https://doi.org/10.1007/978-981-97-2387-4_11
Published: 28 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2386-7
Online ISBN: 978-981-97-2387-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DADR: A Denoising Approach for Dense Retrieval Model Training