skip to main content
10.1145/3581783.3611938acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Published: 27 October 2023 Publication History

Abstract

Voice-face association is generally specialized as a cross-modal cognitive matching problem, and recent attention has been paid on the feasibility of devising the computational mechanisms for recognizing such associations. Existing works are commonly resorting to the combination of contrastive learning and classification-based loss to correlate the heterogeneous datas. Nevertheless, the reliance on typical features of each category, known as archetypes, derived from the combination suffer from the weak invariance of modality-specific features within the same identity, which might induce a cross-modal joint feature space with calibration deviations. To tackle these problems, this paper presents an efficient Archetype-agnostic framework for reliable voice-face association. First, an Archetype-agnostic Subspace Merging (AaSM) method is carefully designed to perform feature calibration which can well get rid of the archetype dependence to facilitate the mutual perception of datas. Further, an efficient Bilateral Connection Re-gauging scheme is proposed to quantitatively screen and calibrate the biased datas, namely loose pairs that deviate from joint feature space. Besides, an Instance Equilibrium strategy is dynamically derived to optimize the training process on loose data pairs and significantly improve the data utilization. Through the joint exploitation of the above, the proposed framework can well associate the voice-face data to benefit various kinds of cross-modal cognitive tasks. Extensive experiments verify the superiorities of the proposed voice-face association framework and show its competitive performances with the state-of-the-arts.

References

[1]
Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems (2020), 25--37.
[2]
Guangyu Chen, Deyuan Zhang, Tao Liu, and Xiaoyong Du. 2022. Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning. In Proceedings of the International Conference on Multimedia Retrieval (ICML). 527--535.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning (ICML). 1597--1607.
[4]
Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 403--412.
[5]
Kai Cheng, Xin Liu, Yiu-ming Cheung, Rui Wang, Xing Xu, and Bineng Zhong. 2020a. Hearing like seeing: improving voice-face interactions and associations via adversarial deep semantic matching network. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). 448--455.
[6]
Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020b. Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). 3884--3892.
[7]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In Int. Conf. on Interspeech.
[8]
Ariel Ephrat, Inbar Mosseri, Oran Lang, et al. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. In ACM Transactions on Graphics. Article 112, 11 pages.
[9]
Damianos Galanopoulos, Mezaris, et al. 2021. Hard-Negatives or Non-Negatives? A Hard-Negative Selection Strategy for Cross-Modal Retrieval Using the Improved Marginal Ranking Loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2312--2316.
[10]
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In Proceedings of the conference on European Conference on Computer Vision (ECCV). 87--102.
[11]
Michael Gutmann and Aapo Hyv"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 297--304.
[12]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR). 9729--9738.
[13]
Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International conference on machine learning (ICML). 4182--4192.
[14]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In Defense of the Triplet Loss for Person Re-Identification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR).
[15]
Miyuki Kamachi, Harold Hill, Karen Lander, and Eric Vatikiotis-Bateson. 2003. Putting the face to the voice': Matching identity across modality. Current Biology, Vol. 13, 19 (2003), 1709--1714.
[16]
Changil Kim, Hijung Valentina Shin, Tae-Hyun Oh, et al. 2019. On Learning Associations of Faces and Voices. In Proceedings of the Asian Conference on Computer Vision (ACCV). 276--292.
[17]
Minyoung Kim, Ricardo Guerrero, and Vladimir Pavlovic. 2020. Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach. arXiv preprint arXiv:2012.00682 (2020).
[18]
Xin Liu, Zhikai Hu, Haibin Ling, and Yiu-ming Cheung. 2021. MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2021), 964--981.
[19]
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 12475--12486.
[20]
Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018a. Learnable PINs: Cross-modal Embeddings for Person Identity. In Proceedings of the European Conference on Computer Vision (ECCV). 73--89.
[21]
Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018b. Seeing Voices and Hearing Faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 8427--8436.
[22]
Arsha Nagrani, Joon Son Chung, Samuel Albanie, et al. 2020. Disentangled speech embeddings using cross-modal self-supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6829--6833.
[23]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).
[24]
Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo, Arif Mahmood, and Alessandro Calefati. 2019. Deep latent space learning for cross-modal mapping of audio and visual signals. In Proceedings of the conference on Digital Image Computing: Techniques and Applications (DICTA). 1--7.
[25]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 4004--4012.
[26]
Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In Proceedings of the British Machine Vision Conference (BMVC).
[27]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 815--823.
[28]
Kihyuk Sohn. 2016. Improved deep metric learning with multi-class N-pair loss objective. Advances in Neural Information Processing Systems (NIPS), Vol. 29 (2016).
[29]
Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 6398--6407.
[30]
Thomas Theodoridis, Theocharis Chatzis, Vassilios Solachidis, et al. 2020. Cross-modal Variational Alignment of Latent Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 960--961.
[31]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision (ECCV). 776--794.
[32]
Rui Wang, Xin Liu, Yiu-ming Cheung, Kai Cheng, Nannan Wang, and Wentao Fan. 2020. Learning Discriminative Joint Embeddings for Efficient Face and Voice Association. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 1881--1884.
[33]
Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised Learning of Visual Representations Using Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2794--2802.
[34]
Peisong Wen, Qianqian Xu, Yangbangyan Jiang, et al. 2021. Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 16347--16356.
[35]
Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, et al. 2018. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In Proceedings of the International Conference on Learning Representations (ICLR).
[36]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, et al. 2018. Unsupervised Feature Learning via Non-parametric Instance Discrimination. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 3733--3742.
[37]
Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, and Yuxing Peng. 2022. Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast. arXiv preprint arXiv:2204.14057 (2022).

Index Terms

  1. Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. archetype-agnostic
    2. instance equilibrium
    3. re-gauging
    4. voice-face association

    Qualifiers

    • Research-article

    Funding Sources

    • Zhejiang Lab
    • NSFC/Research Grants Council (RGC) Joint Research Scheme

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 108
      Total Downloads
    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media