skip to main content
10.1145/3512527.3531364acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

Published: 27 June 2022 Publication History

Abstract

Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering" and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.

Supplementary Material

MP4 File (ICMR22-icmrfp068.mp4)
Presentation video of "Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning" Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. Herein, we analyze the solution to unsupervised VFAL. An unsupervised learning framework with three solid baselines is proposed. Experimental results show that our SL framework can effectively use unlabeled video data for learning. It exceeds other unsupervised competitors and bridges the performance gap with supervised approaches. Moreover, this framework can also be used as an effective pre-training method to improve the effectiveness of the existing method.

References

[1]
2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Transactions on Graphics, Vol. 37, 4CD (2018), 112.1--112.11.
[2]
2020. Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network. In MM '20: The 28th ACM International Conference on Multimedia .
[3]
R Arandjelovic and A. Zisserman. 2017. Look, Listen and Learn. In 2017 IEEE International Conference on Computer Vision (ICCV) .
[4]
Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. 2017. VGGFace2: A dataset for recognising faces across pose and age. IEEE International Conference on Automatic Face & Gesture Recognition (2017).
[5]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV). 132--149.
[6]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882 (2020).
[7]
Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15750--15758.
[8]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In Proc. Interspeech 2018. 1086--1090. https://doi.org/10.21437/Interspeech.2018--1929
[9]
Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251--263.
[10]
J. S. Chung and A. Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. In Asian Conference on Computer Vision .
[11]
B. Desplanques, J. Thienpondt, and K. Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Interspeech 2020 .
[12]
Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval.
[13]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020).
[14]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.
[15]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), Vol. 28, 1 (1979), 100--108.
[16]
Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017).
[17]
Shota Horiguchi, Naoyuki Kanda, and Kenji Nagamatsu. 2018. Face-Voice Matching Using Cross-Modal Embeddings. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM '18). Association for Computing Machinery, New York, NY, USA, 1011--1019. https://doi.org/10.1145/3240508.3240601
[18]
Peng Hu, Dezhong Peng, Xu Wang, and Yong Xiang. 2019. Multimodal adversarial network for cross-modal retrieval. Knowledge-Based Systems, Vol. 180 (2019), 38--50.
[19]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
[20]
C. Kim, H. V. Shin, T. H. Oh, A. Kaspar, and W. Matusik. 2019. On Learning Associations of Faces and Voices.
[21]
D. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. Computer Science (2014).
[22]
Chenqi Kong, Baoliang Chen, Wenhan Yang, Haoliang Li, Peilin Chen, and Shiqi Wang. 2021. Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[23]
Mavica and W. Lauren. 2013. Matching Voice and Face Identity From Static Images. Journal of Experimental Psychology (2013).
[24]
K. G. Munhall and E. Vatikiotis-Bateson. 1998. The moving face during speech communication. (1998).
[25]
Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. 2020. PyTorch Metric Learning. arxiv: 2008.09164 [cs.CV]
[26]
A. Nagrani, S. Albanie, and A. Zisserman. 2018a. Learnable PINs: Cross-Modal Embeddings for Person Identity. Springer, Cham (2018).
[27]
A. Nagrani, S. Albanie, and A. Zisserman. 2018b. Seeing Voices and Hearing Faces: Cross-modal biometric matching. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition .
[28]
A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. In Interspeech .
[29]
S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati. 2019. Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals. (2019).
[30]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[31]
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arxiv: 2106.04624 [eess.AS] arXiv:2106.04624.
[32]
Hmj Smith, A. K. Dunn, T. Baguley, and P. C. Stacey. 2016. Matching novel face and voice identity using static and dynamic facial images. Attention, Perception, Psychophysics, Vol. 78, 3 (2016), 868--879.
[33]
Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei. 2020. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In IEEE .
[34]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--9. https://doi.org/10.1109/CVPR.2015.7298594
[35]
Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual speaker recognition with a cross-modal discriminative network. arXiv preprint arXiv:2008.03894 (2020).
[36]
Rui Wang, Xin Liu, Yiu-ming Cheung, Kai Cheng, Nannan Wang, and Wentao Fan. 2020. Learning Discriminative Joint Embeddings for Efficient Face and Voice Association. Association for Computing Machinery, New York, NY, USA, 1881--1884. https://doi.org/10.1145/3397271.3401302
[37]
Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. 2019. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5022--5030.
[38]
P. Wen, Q. Xu, Y Jiang, Z. Yang, and Q. Huang. 2021. Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association. (2021).
[39]
Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh. 2018. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. (2018).
[40]
Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision. 2840--2848.
[41]
Chuyuan Xiong, Deyuan Zhang, Tao Liu, and Xiaoyong Du. 2019. Voice-Face Cross-modal Matching and Retrieval: A Benchmark. arxiv: 1911.09338 [cs.CV]
[42]
H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. 1998. Quantitative association of vocal-tract and facial behavior. Elsevier Science Publishers B. V. (1998).
[43]
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230 (2021).
[44]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.
[45]
A. Zheng, M. Hu, B. Jiang, Y. Huang, and B. Luo. 2021. Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching. IEEE Transactions on Multimedia, Vol. PP, 99 (2021), 1--1.

Cited By

View all
  • (2024)Convex Feature Embedding for Face and Voice AssociationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657975(2342-2346)Online publication date: 10-Jul-2024
  • (2024)Public-Private Attributes-Based Variational Adversarial Network for Audio-Visual Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339057334:9(8698-8709)Online publication date: Sep-2024
  • (2023)EMP: Emotion-guided Multi-modal Fusion and Contrastive Learning for Personality Traits RecognitionProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592243(243-252)Online publication date: 12-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal matching
  2. cross-modal retrieval
  3. unsupervised learning
  4. voice-face association

Qualifiers

  • Research-article

Funding Sources

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)10
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Convex Feature Embedding for Face and Voice AssociationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657975(2342-2346)Online publication date: 10-Jul-2024
  • (2024)Public-Private Attributes-Based Variational Adversarial Network for Audio-Visual Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339057334:9(8698-8709)Online publication date: Sep-2024
  • (2023)EMP: Emotion-guided Multi-modal Fusion and Contrastive Learning for Personality Traits RecognitionProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592243(243-252)Online publication date: 12-Jun-2023
  • (2023)Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face AssociationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611938(7056-7064)Online publication date: 26-Oct-2023
  • (2023)EFT: Expert Fusion Transformer for Voice-Face Association Learning2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00443(2603-2608)Online publication date: Jul-2023
  • (2023)Local-Global Contrast for Learning Voice-Face Representations2023 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP49359.2023.10222130(51-55)Online publication date: 8-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media