Abstract
Cross-modal face-voice association is an active field that utilizes biometric features for cross-modal information retrieval. The primary approach for addressing this task involves utilizing contrastive learning to construct a modality-agnostic subspace. However, many existing contrastive learning methods in cross-modal research tend to neglect the significance of symmetrical information within heterogeneous data. This oversight leads to the generation of different negative examples for each identity in a random mini-batch. Furthermore, the length of negative examples in contrastive learning is coupled with the mini-batch size and is limited by the GPU memory size. To address these issues, this paper introduces an innovative Cross-Modal Momentum Contrast (CMMC) algorithm, which leverages queues to provide sufficient and symmetric information. Moreover, we propose an update strategy to maintain the consistency of negative example information throughout the training process. By combining the operations mentioned above, our proposed CMMC can effectively improve the correlation between face and voice data. Extensive experiments conducted on two datasets confirm the superiority of our framework and demonstrate its competitive performance compared to state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3), 367ā376 (2011)
Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: Putting the face to the voiceā: matching identity across modality. Curr. Biol. 13(19), 1709ā1714 (2003)
Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecol. Psychol. 16(3), 159ā187 (2004)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609ā617 (2017)
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision, pp. 435ā451 (2018)
Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8427ā8436 (2018)
Wen, Y., Ismail, M.A., Liu, W., Raj, B., Singh, R.: Disjoint mapping network for cross-modal matching of voices and faces. In: Proceedings of the International Conference on Learning Representations (2018)
Kim, C., Shin, H.V., Oh, T.-H., Kaspar, A., Elgharib, M., Matusik, W.: On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision, pp. 276ā292 (2019)
Nagrani, A., Albanie, S., Zisserman, A.: Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European Conference on Computer Vision, pp. 71ā88 (2018)
Wang, R., Liu, X., Cheung, Y.-M., Cheng, K., Wang, N., Fan, W.: Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1881ā1884 (2020)
Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., Huang, Q.: Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16347ā16356 (2021)
Horiguchi, S., Kanda, N., Nagamatsu, K.: Face-voice matching using cross-modal embeddings. In: Proceedings of the ACM International Conference on Multimedia, pp. 1011ā1019 (2018)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735ā1742 (2006)
Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10(2), 207ā244 (2009)
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst. 29 (2016)
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312ā329 (2020)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 205ā214 (2021)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9729ā9738 (2020)
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference, pp. 41.1ā41.12 (2015)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In: Proceedings of the European Conference on Computer Vision, pp. 87ā102 (2016)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech, pp. 1086ā1090 (2018)
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579ā2605 (2008)
Acknowledgements
This work was jointly supported by Natural Science Foundation of China under Grants 61972166 and 62372190, and Industry University Cooperation Project of Fujian Province under Grant 2021H603.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qiu, Y., Yu, Z., Gao, Z. (2024). An Efficient Momentum Framework forĀ Face-Voice Association Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_22
Download citation
DOI: https://doi.org/10.1007/978-981-99-8429-9_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)