An Efficient Momentum Framework for Face-Voice Association Learning

Qiu, Yuanyuan; Yu, Zhenning; Gao, Zhenguo

doi:10.1007/978-981-99-8429-9_22

Yuanyuan Qiu^15,16,
Zhenning Yu¹⁵ &
Zhenguo Gao^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

943 Accesses

Abstract

Cross-modal face-voice association is an active field that utilizes biometric features for cross-modal information retrieval. The primary approach for addressing this task involves utilizing contrastive learning to construct a modality-agnostic subspace. However, many existing contrastive learning methods in cross-modal research tend to neglect the significance of symmetrical information within heterogeneous data. This oversight leads to the generation of different negative examples for each identity in a random mini-batch. Furthermore, the length of negative examples in contrastive learning is coupled with the mini-batch size and is limited by the GPU memory size. To address these issues, this paper introduces an innovative Cross-Modal Momentum Contrast (CMMC) algorithm, which leverages queues to provide sufficient and symmetric information. Moreover, we propose an update strategy to maintain the consistency of negative example information throughout the training process. By combining the operations mentioned above, our proposed CMMC can effectively improve the correlation between face and voice data. Extensive experiments conducted on two datasets confirm the superiority of our framework and demonstrate its competitive performance compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3), 367–376 (2011)
Article Google Scholar
Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: Putting the face to the voice’: matching identity across modality. Curr. Biol. 13(19), 1709–1714 (2003)
Article Google Scholar
Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecol. Psychol. 16(3), 159–187 (2004)
Article Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)
Google Scholar
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision, pp. 435–451 (2018)
Google Scholar
Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8427–8436 (2018)
Google Scholar
Wen, Y., Ismail, M.A., Liu, W., Raj, B., Singh, R.: Disjoint mapping network for cross-modal matching of voices and faces. In: Proceedings of the International Conference on Learning Representations (2018)
Google Scholar
Kim, C., Shin, H.V., Oh, T.-H., Kaspar, A., Elgharib, M., Matusik, W.: On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision, pp. 276–292 (2019)
Google Scholar
Nagrani, A., Albanie, S., Zisserman, A.: Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European Conference on Computer Vision, pp. 71–88 (2018)
Google Scholar
Wang, R., Liu, X., Cheung, Y.-M., Cheng, K., Wang, N., Fan, W.: Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1881–1884 (2020)
Google Scholar
Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., Huang, Q.: Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16347–16356 (2021)
Google Scholar
Horiguchi, S., Kanda, N., Nagamatsu, K.: Face-voice matching using cross-modal embeddings. In: Proceedings of the ACM International Conference on Multimedia, pp. 1011–1019 (2018)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735–1742 (2006)
Google Scholar
Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10(2), 207–244 (2009)
Google Scholar
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312–329 (2020)
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 205–214 (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020)
Article Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference, pp. 41.1–41.12 (2015)
Google Scholar
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In: Proceedings of the European Conference on Computer Vision, pp. 87–102 (2016)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech, pp. 1086–1090 (2018)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar

Download references

Acknowledgements

This work was jointly supported by Natural Science Foundation of China under Grants 61972166 and 62372190, and Industry University Cooperation Project of Fujian Province under Grant 2021H603.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Huaqiao University, Xiamen, 361021, Fujian, China
Yuanyuan Qiu, Zhenning Yu & Zhenguo Gao
Key Laboratory of Computer Vision and Machine Learning of Fujian Province University, Xiamen, 361021, Fujian, China
Yuanyuan Qiu & Zhenguo Gao

Authors

Yuanyuan Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenning Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenguo Gao .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiu, Y., Yu, Z., Gao, Z. (2024). An Efficient Momentum Framework for Face-Voice Association Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_22

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_22
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Efficient Momentum Framework for Face-Voice Association Learning