Skip to main content

An Efficient Momentum Framework forĀ Face-Voice Association Learning

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

  • 943 Accesses

Abstract

Cross-modal face-voice association is an active field that utilizes biometric features for cross-modal information retrieval. The primary approach for addressing this task involves utilizing contrastive learning to construct a modality-agnostic subspace. However, many existing contrastive learning methods in cross-modal research tend to neglect the significance of symmetrical information within heterogeneous data. This oversight leads to the generation of different negative examples for each identity in a random mini-batch. Furthermore, the length of negative examples in contrastive learning is coupled with the mini-batch size and is limited by the GPU memory size. To address these issues, this paper introduces an innovative Cross-Modal Momentum Contrast (CMMC) algorithm, which leverages queues to provide sufficient and symmetric information. Moreover, we propose an update strategy to maintain the consistency of negative example information throughout the training process. By combining the operations mentioned above, our proposed CMMC can effectively improve the correlation between face and voice data. Extensive experiments conducted on two datasets confirm the superiority of our framework and demonstrate its competitive performance compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3), 367ā€“376 (2011)

    ArticleĀ  Google ScholarĀ 

  2. Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: Putting the face to the voiceā€™: matching identity across modality. Curr. Biol. 13(19), 1709ā€“1714 (2003)

    ArticleĀ  Google ScholarĀ 

  3. Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecol. Psychol. 16(3), 159ā€“187 (2004)

    ArticleĀ  Google ScholarĀ 

  4. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609ā€“617 (2017)

    Google ScholarĀ 

  5. Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision, pp. 435ā€“451 (2018)

    Google ScholarĀ 

  6. Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8427ā€“8436 (2018)

    Google ScholarĀ 

  7. Wen, Y., Ismail, M.A., Liu, W., Raj, B., Singh, R.: Disjoint mapping network for cross-modal matching of voices and faces. In: Proceedings of the International Conference on Learning Representations (2018)

    Google ScholarĀ 

  8. Kim, C., Shin, H.V., Oh, T.-H., Kaspar, A., Elgharib, M., Matusik, W.: On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision, pp. 276ā€“292 (2019)

    Google ScholarĀ 

  9. Nagrani, A., Albanie, S., Zisserman, A.: Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European Conference on Computer Vision, pp. 71ā€“88 (2018)

    Google ScholarĀ 

  10. Wang, R., Liu, X., Cheung, Y.-M., Cheng, K., Wang, N., Fan, W.: Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1881ā€“1884 (2020)

    Google ScholarĀ 

  11. Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., Huang, Q.: Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16347ā€“16356 (2021)

    Google ScholarĀ 

  12. Horiguchi, S., Kanda, N., Nagamatsu, K.: Face-voice matching using cross-modal embeddings. In: Proceedings of the ACM International Conference on Multimedia, pp. 1011ā€“1019 (2018)

    Google ScholarĀ 

  13. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735ā€“1742 (2006)

    Google ScholarĀ 

  14. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10(2), 207ā€“244 (2009)

    Google ScholarĀ 

  15. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst. 29 (2016)

    Google ScholarĀ 

  16. Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312ā€“329 (2020)

    Google ScholarĀ 

  17. Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 205ā€“214 (2021)

    Google ScholarĀ 

  18. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9729ā€“9738 (2020)

    Google ScholarĀ 

  19. Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020)

    ArticleĀ  Google ScholarĀ 

  20. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference, pp. 41.1ā€“41.12 (2015)

    Google ScholarĀ 

  21. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In: Proceedings of the European Conference on Computer Vision, pp. 87ā€“102 (2016)

    Google ScholarĀ 

  22. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech, pp. 1086ā€“1090 (2018)

    Google ScholarĀ 

  23. van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579ā€“2605 (2008)

    Google ScholarĀ 

Download references

Acknowledgements

This work was jointly supported by Natural Science Foundation of China under Grants 61972166 and 62372190, and Industry University Cooperation Project of Fujian Province under Grant 2021H603.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenguo Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qiu, Y., Yu, Z., Gao, Z. (2024). An Efficient Momentum Framework forĀ Face-Voice Association Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8429-9_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8428-2

  • Online ISBN: 978-981-99-8429-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics