Skip to main content
Log in

Learning coordinated emotion representation between voice and face

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Voice and face information are two most important perceptual modalities for human. In recent years, many researchers show great interest in learning cross-modal representations for different face-voice association tasks. However, these existing methods focus on the various biological attributions but rarely take emotion semantics between voice and face into account. In this paper, we present a novel two-stream model, called Emotion Representation Learning Network (EmoRL-Net), to learn the cross-modal coordinated emotion representations for various downstream matching and retrieval tasks. Within the proposed approach, we first propose two sub-network architectures that learn two unimodal features from the two modalities. Afterwards, we train EmoRL-Net by an objective loss function which includes one explicit and two implicit constraints. Meanwhile, an online semi-hard negative mining strategy is utilized to construct triplet units in a mini-batch manner, thereby stabilize and speeding up the learning process. Extensive experiments demonstrate that the proposed method can benefit various face-voice emotion tasks, including cross-modal verification, 1:2 matching, 1:N matching, and retrieval scenarios. The experiment results also show the proposed method outperforms the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Joassin F, Pesenti M, Maurage P, Verreckt E, Bruyer R, Campanella S (2011) Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3):367–376. https://doi.org/10.1016/j.cortex.2010.03.003

    Article  Google Scholar 

  2. Schirmer A, Adolphs R (2017 ) Emotion perception from face, voice, and touch: comparisons and convergence. Trends Cognit Sci 21(3):216–228. https://doi.org/10.1016/j.tics.2017.01.001

    Article  Google Scholar 

  3. Zweig LJ, Suzuki S, Grabowecky M (2015) Learned face-voice pairings facilitate visual search. Psychon Bullet Rev 22(2):429–436. https://doi.org/10.3758/s13423-014-0685-3

    Article  Google Scholar 

  4. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Inf Fusion 37:98–125. https://doi.org/10.1016/j.inffus.2017.02.003

    Article  Google Scholar 

  5. Zhu H, Luo M, Wang R, Zheng A H, He R (2021) Deep audio-visual learning: a survey. Int J Autom Comput 18(5234):1–26

    Google Scholar 

  6. Kamachi M, Hill H, Lander K, Vatikiotis-Bateson E (2003) Putting the face to the voice’: matching identity across modality. Curr Biol 13(19):1709–1714

    Article  Google Scholar 

  7. Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: IEEE computer vision and pattern recognition (CVPR), pp 8427–8436

  8. Kim C, Shin H V, Oh T-H, Kaspar A, Elgharib M, Matusik W (2018) On learning associations of faces and voices. In: Asian conference on computer vision(ACCV), pp 276–292

  9. Kansizoglou I, Bampis L, Gasteratos A (2019) An active learning paradigm for online audio-visual emotion recognition. IEEE Trans Affect Comput:1–1, https://doi.org/10.1109/TAFFC.2019.2961089https://doi.org/10.1109/TAFFC.2019.2961089

  10. Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with lstms. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2592–2596. https://doi.org/10.1109/ICASSP.2017.7952625

  11. Solèr M, Bazin J-C, Wang O, Krause A, Sorkine-Hornung A (2016) Suggesting sounds for images from video collections. In: European conference on computer vision (ECCV) workshops. Computer vision – ECCV 2016 workshops, pp 900– 917

  12. Owens A, Isola P, McDermott J, Torralba A, Adelson E H, Freeman W T (2016) Visually indicated sounds. In: IEEE computer vision and pattern recognition (CVPR), pp 2405–2413

  13. Chung J S, Jamaludin A, Zisserman A (2017) You said that?. In: British machine vision conference (BMVC)

  14. Kumar R, Sotelo J, Kumar K, de Brébisson A, Bengio Y (2017) Obamanet: Photo-realistic lip-sync from text. arXiv:1801.01442

  15. Han F, Guerrero R, Pavlovic V (2020) CookGAN: Meal Image Synthesis from Ingredients.. In: IEEE Winter Conference on Applications of Computer Vision (WACV), 2020: 1439–1447

  16. Qiu Y, Kataoka H (2018) Image generation associated with music data. In: IEEE computer vision and pattern recognition (CVPR), pp 2510–2513

  17. Fang Z, Liu Z, Liu T, Hung C-C, Xiao J, Feng G (2021) Facial expression gan for voice-driven face generation. Visual Comput, https://doi.org/10.1007/s00371-021-02074-w

  18. Duarte A, Roldan F, Tubau M, Escur J, Pascual S, Salvador A, Mohedano E, Mcguinness K, Torres J, Giroinieto X (2019) Wav2pix: speech-conditioned face generation using generative adversarial networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8633–8637

  19. Xiong C, Zhang D, Liu T, Du X (2019) Voice-face cross-modal matching and retrieval: a benchmark. arXiv:1911.09338

  20. Nawaz S, Janjua M K, Gallo I, Mahmood A, Calefati A (2019) Deep latent space learning for cross-modal mapping of audio and visual signals. In: Digital image computing: techniques and applications (DICTA), pp 1–7

  21. Verma G, Dhekane E G, Guha T (2019) Learning affective correspondence between music and image. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3975–3979

  22. Wen Y, Ismail M A, Liu W, Raj B, Singh R (2018) Disjoint mapping network for cross-modal matching of voices and faces. In: International conference on learning representations (ICLR)

  23. Nagrani A, Albanie S, Zisserman A (2018) Learnable pins: Cross-modal embeddings for person identity. In: European conference on computer vision (ECCV), pp 71–88

  24. Wang R, Liu X, Cheung Y-M, Cheng K, Wang N, Fan W (2020) Learning discriminative joint embeddings for efficient face and voice association. In: International ACM special interest group on information retrieval (SIGIR), pp 1881–1884

  25. Horiguchi S, Kanda N, Nagamatsu K (2018) Face-voice matching using cross-modal embeddings. In: ACM multimedia, pp 1011–1019

  26. Wen P, Xu Q, Jiang Y, Yang Z, He Y, Huang Q (2021) Seeking the shape of sound: An adaptive framework for learning voice-face association. In: IEEE computer vision and pattern recognition (CVPR)

  27. Hoffer E, Ailon N (2014) Deep metric learning using triplet network. In: International workshop on similarity-based pattern analysis and recognition (SIMBAD)

  28. Ding S, Lin L, Wang G, Chao H (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recogn 48:2993–3003

    Article  Google Scholar 

  29. Sun Y, Chen Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. In: Neural information processing systems (NIPS)

  30. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision (ECCV)

  31. Wen Y, Zhang K, Li Z, Qiao Y (2018) A comprehensive study on center loss for deep face recognition. Int J Comput Vision

  32. Mai S, Hu H, Xu J, Xing S (2022) Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans Affect Comput 13(1):320–334. https://doi.org/10.1109/TAFFC.2020.3000510https://doi.org/10.1109/TAFFC.2020.3000510

    Article  Google Scholar 

  33. Liang J, Li R, Jin Q (2020) Semi-supervised multi-modal emotion recognition with cross-modal distribution matching, pp 2852–2861

  34. Abdollahi H, Mahoor M, Zandie R, Sewierski J, Qualls S (2022) Artificial emotional intelligence in socially assistive robots for older adults: a pilot study. IEEE Trans Affect Comput:1–1, https://doi.org/10.1109/TAFFC.2022.3143803

  35. Hong A, Lunscher N, Hu T, Tsuboi Y, Zhang X, Franco dos Reis Alves S, Nejat G, Benhabib B (2021) A multimodal emotional human–robot interaction architecture for social robots engaged in bidirectional communication. IEEE Trans Cybern 51(12):5954–5968. https://doi.org/10.1109/TCYB.2020.2974688

  36. Mariooryad S, Busso C (2013) Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans Affect Comput 4:183–196

    Article  Google Scholar 

  37. Baltrušaitis T, Ahuja C, Morency L-P (2017) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  Google Scholar 

  38. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Neural information processing systems (NIPS), pp 1857–1865

  39. Faghri F, Fleet D J, Kiros J, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. In: British machine vision conference (BMVC)

  40. Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimedia:1–14

  41. Cheng K, Liu X, Cheung Y-M, Wang R, Xu X, Zhong B (2020) Hearing like seeing: improving voice-face interactions and associations via adversarial deep semantic matching network. In: ACM international multimedia

  42. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE computer vision and pattern recognition (CVPR), pp 770–778

  43. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision (ECCV), arXive:1603.05027

  44. Li S, Deng W (2020) Deep facial expression recognition: a survey. IEEE Trans Affect Comput

  45. Rao K S, Koolagudi S G, Reddy V R (2013) Emotion recognition from speech using global and local prosodic features. Int J Speech Technol 16:143–160

    Article  Google Scholar 

  46. Akçay MB, Oguz K (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Comm 116:56–76

    Article  Google Scholar 

  47. Pascual S, Ravanelli M, Serrà J., Bonafonte A, Bengio Y (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. In: INTERSPEECH

  48. Ravanelli M, Zhong J, Pascual S, Swietojanski P, Monteiro J, Trmal J, Bengio Y (2020) Multi-task self-supervised learning for robust speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6989–6993

  49. Kaya M, Bilge H S (2019) Deep metric learning: a survey. Symmetry 11:1066

    Article  Google Scholar 

  50. Manmatha R, Wu C-Y, Smola A, Krähenbühl P (2017) Sampling matters in deep embedding learning

  51. Wang X, Han X, Huang W, Dong D, Scott M R (2019) Multi-similarity loss with general pair weighting for deep metric learning. In: IEEE computer vision and pattern recognition (CVPR), pp 5017–5025

  52. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE computer vision and pattern recognition (CVPR), pp 815–823

  53. Livingstone S R, Russo F A (2015) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5):1–35

    Google Scholar 

  54. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: international conference on data engineering workshops (ICDEW’06), pp 8–8

  55. Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) Crema-d: crowd-sourced emotional multimodal actors dataset. IEEE Trans Affective Comput 5(4):377–390. https://doi.org/10.1109/TAFFC.2014.2336244

    Article  Google Scholar 

  56. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23:1499–1503

    Article  Google Scholar 

  57. Segbroeck M V, Tsiartas A, Narayanan S S (2013) A robust frontend for vad: exploiting contextual, discriminative and spectral cues of human voice. In: INTERSPEECH

  58. Lee S, Yu Y, Kim G, Breuel T, Kautz J, Song Y (2021) Parameter efficient multimodal transformers for video representation learning. In: International conference for learning representations (ICLR), arXive:2012.04124

  59. Kingma D P, Ba J (2015) Adam: a method for stochastic optimization. In: International conference for learning representations (ICLR)

  60. Musgrave K, Belongie SJ, Lim S-N (2020) Pytorch metric learning. arXiv:2008.09164

  61. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10386–10395

  62. Maaten Lvd, Hinton GE (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

Download references

Acknowledgements

This work was sponsored by the Ningbo Science Technology Plan projects (Grants 2020Z082, 2021S091), and the K.C. Wong Magna Fund in Ningbo University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, Z., Liu, Z., Hung, CC. et al. Learning coordinated emotion representation between voice and face. Appl Intell 53, 14470–14492 (2023). https://doi.org/10.1007/s10489-022-04216-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04216-6

Keywords

Navigation