Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

Published: 27 June 2022


Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering" and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.

MP4 File (ICMR22-icmrfp068.mp4)
Presentation video of "Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning" Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. Herein, we analyze the solution to unsupervised VFAL. An unsupervised learning framework with three solid baselines is proposed. Experimental results show that our SL framework can effectively use unlabeled video data for learning. It exceeds other unsupervised competitors and bridges the performance gap with supervised approaches. Moreover, this framework can also be used as an effective pre-training method to improve the effectiveness of the existing method.


  • (2024)Convex Feature Embedding for Face and Voice AssociationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657975(2342-2346)Online publication date: 10-Jul-2024
  • (2024)Public-Private Attributes-Based Variational Adversarial Network for Audio-Visual Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339057334:9(8698-8709)Online publication date: Sep-2024
  • (2023)EMP: Emotion-guided Multi-modal Fusion and Contrastive Learning for Personality Traits RecognitionProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592243(243-252)Online publication date: 12-Jun-2023
  • Show More Cited By



ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
Published: 27 June 2022


Author Tags

  1. cross-modal matching
  2. cross-modal retrieval
  3. unsupervised learning
  4. voice-face association


  • (2024)Convex Feature Embedding for Face and Voice AssociationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657975(2342-2346)Online publication date: 10-Jul-2024
  • (2024)Public-Private Attributes-Based Variational Adversarial Network for Audio-Visual Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339057334:9(8698-8709)Online publication date: Sep-2024
  • (2023)EMP: Emotion-guided Multi-modal Fusion and Contrastive Learning for Personality Traits RecognitionProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592243(243-252)Online publication date: 12-Jun-2023
  • (2023)Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face AssociationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611938(7056-7064)Online publication date: 26-Oct-2023
  • (2023)EFT: Expert Fusion Transformer for Voice-Face Association Learning2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00443(2603-2608)Online publication date: Jul-2023
  • (2023)Local-Global Contrast for Learning Voice-Face Representations2023 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP49359.2023.10222130(51-55)Online publication date: 8-Oct-2023

