research-article

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

Authors:

Guangyu Chen,

Deyuan Zhang,

Tao Liu,

Xiaoyong DuAuthors Info & Claims

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Pages 527 - 535

https://doi.org/10.1145/3512527.3531364

Published: 27 June 2022 Publication History

Get Access

Abstract

Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering" and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.

Supplementary Material

MP4 File (ICMR22-icmrfp068.mp4)

Presentation video of "Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning" Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. Herein, we analyze the solution to unsupervised VFAL. An unsupervised learning framework with three solid baselines is proposed. Experimental results show that our SL framework can effectively use unlabeled video data for learning. It exceeds other unsupervised competitors and bridges the performance gap with supervised approaches. Moreover, this framework can also be used as an effective pre-training method to improve the effectiveness of the existing method.

Download
192.90 MB

References

[1]

2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Transactions on Graphics, Vol. 37, 4CD (2018), 112.1--112.11.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Unsupervised meta-learning for few-shot learning

Unsupervised Cell Segmentation in Fluorescence Microscopy Images via Self-supervised Learning

A novel double-layer sparse representation approach for unsupervised dictionary learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations