research-article

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Authors:

Guancheng Chen,

Yiu-ming Cheung,

Taihao LiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 7056 - 7064

https://doi.org/10.1145/3581783.3611938

Published: 27 October 2023 Publication History

Abstract

Voice-face association is generally specialized as a cross-modal cognitive matching problem, and recent attention has been paid on the feasibility of devising the computational mechanisms for recognizing such associations. Existing works are commonly resorting to the combination of contrastive learning and classification-based loss to correlate the heterogeneous datas. Nevertheless, the reliance on typical features of each category, known as archetypes, derived from the combination suffer from the weak invariance of modality-specific features within the same identity, which might induce a cross-modal joint feature space with calibration deviations. To tackle these problems, this paper presents an efficient Archetype-agnostic framework for reliable voice-face association. First, an Archetype-agnostic Subspace Merging (AaSM) method is carefully designed to perform feature calibration which can well get rid of the archetype dependence to facilitate the mutual perception of datas. Further, an efficient Bilateral Connection Re-gauging scheme is proposed to quantitatively screen and calibrate the biased datas, namely loose pairs that deviate from joint feature space. Besides, an Instance Equilibrium strategy is dynamically derived to optimize the training process on loose data pairs and significantly improve the data utilization. Through the joint exploitation of the above, the proposed framework can well associate the voice-face data to benefit various kinds of cross-modal cognitive tasks. Extensive experiments verify the superiorities of the proposed voice-face association framework and show its competitive performances with the state-of-the-arts.

References

[1]

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems (2020), 25--37.

[2]

Guangyu Chen, Deyuan Zhang, Tao Liu, and Xiaoyong Du. 2022. Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning. In Proceedings of the International Conference on Multimedia Retrieval (ICML). 527--535.

Digital Library

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning (ICML). 1597--1607.

[4]

Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 403--412.

[5]

Kai Cheng, Xin Liu, Yiu-ming Cheung, Rui Wang, Xing Xu, and Bineng Zhong. 2020a. Hearing like seeing: improving voice-face interactions and associations via adversarial deep semantic matching network. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). 448--455.

Digital Library

[6]

Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020b. Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). 3884--3892.

Digital Library

[7]

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In Int. Conf. on Interspeech.

[8]

Ariel Ephrat, Inbar Mosseri, Oran Lang, et al. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. In ACM Transactions on Graphics. Article 112, 11 pages.

[9]

Damianos Galanopoulos, Mezaris, et al. 2021. Hard-Negatives or Non-Negatives? A Hard-Negative Selection Strategy for Cross-Modal Retrieval Using the Improved Marginal Ranking Loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2312--2316.

[10]

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In Proceedings of the conference on European Conference on Computer Vision (ECCV). 87--102.

[11]

Michael Gutmann and Aapo Hyv"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 297--304.

[12]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR). 9729--9738.

[13]

Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International conference on machine learning (ICML). 4182--4192.

[14]

Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In Defense of the Triplet Loss for Person Re-Identification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR).

[15]

Miyuki Kamachi, Harold Hill, Karen Lander, and Eric Vatikiotis-Bateson. 2003. Putting the face to the voice': Matching identity across modality. Current Biology, Vol. 13, 19 (2003), 1709--1714.

[16]

Changil Kim, Hijung Valentina Shin, Tae-Hyun Oh, et al. 2019. On Learning Associations of Faces and Voices. In Proceedings of the Asian Conference on Computer Vision (ACCV). 276--292.

[17]

Minyoung Kim, Ricardo Guerrero, and Vladimir Pavlovic. 2020. Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach. arXiv preprint arXiv:2012.00682 (2020).

[18]

Xin Liu, Zhikai Hu, Haibin Ling, and Yiu-ming Cheung. 2021. MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2021), 964--981.

[19]

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 12475--12486.

[20]

Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018a. Learnable PINs: Cross-modal Embeddings for Person Identity. In Proceedings of the European Conference on Computer Vision (ECCV). 73--89.

Digital Library

[21]

Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018b. Seeing Voices and Hearing Faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 8427--8436.

[22]

Arsha Nagrani, Joon Son Chung, Samuel Albanie, et al. 2020. Disentangled speech embeddings using cross-modal self-supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6829--6833.

[23]

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).

[24]

Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo, Arif Mahmood, and Alessandro Calefati. 2019. Deep latent space learning for cross-modal mapping of audio and visual signals. In Proceedings of the conference on Digital Image Computing: Techniques and Applications (DICTA). 1--7.

[25]

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 4004--4012.

[26]

Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In Proceedings of the British Machine Vision Conference (BMVC).

[27]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 815--823.

[28]

Kihyuk Sohn. 2016. Improved deep metric learning with multi-class N-pair loss objective. Advances in Neural Information Processing Systems (NIPS), Vol. 29 (2016).

[29]

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 6398--6407.

[30]

Thomas Theodoridis, Theocharis Chatzis, Vassilios Solachidis, et al. 2020. Cross-modal Variational Alignment of Latent Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 960--961.

[31]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision (ECCV). 776--794.

Digital Library

[32]

Rui Wang, Xin Liu, Yiu-ming Cheung, Kai Cheng, Nannan Wang, and Wentao Fan. 2020. Learning Discriminative Joint Embeddings for Efficient Face and Voice Association. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 1881--1884.

Digital Library

[33]

Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised Learning of Visual Representations Using Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2794--2802.

Digital Library

[34]

Peisong Wen, Qianqian Xu, Yangbangyan Jiang, et al. 2021. Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 16347--16356.

[35]

Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, et al. 2018. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In Proceedings of the International Conference on Learning Representations (ICLR).

[36]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, et al. 2018. Unsupervised Feature Learning via Non-parametric Instance Discrimination. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 3733--3742.

[37]

Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, and Yuxing Peng. 2022. Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast. arXiv preprint arXiv:2204.14057 (2022).

Index Terms

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association
1. Information systems

Recommendations

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this ...
A novel Boolean algebraic framework for association and pattern mining

Data mining has been defined as the non- trivial extraction of implicit, previously unknown and potentially useful information from data. Association mining and sequential mining analysis are considered as crucial components of strategic control over a ...
Association rule mining and quantitative association rule mining among infrequent items

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Zhejiang Lab
NSFC/Research Grants Council (RGC) Joint Research Scheme

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
108
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten