Skip to main content

Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Abstract

Current RGB-D-T face recognition methods are able to alleviate the sensitivity to facial variations, posture, occlusions, and illumination by incorporating complementary information, while they rely heavily on the availability of complete modalities. Given the likelihood of missing modalities in real-world scenarios and the fact that current multi-modal recognition models perform poorly when faced with incomplete data, robust multi-modal models for face recognition that can handle missing modalities are highly desirable. To this end, we propose a multi-modal fusion framework for robustly learning face representations in the presence of missing modalities, using a combination of RGB, depth, and thermal modalities. Our approach effectively blends these modalities together while also alleviating the semantic gap among them. Specifically, we put forward a novel modality-missing loss function to learn the modality-specific features that are robust to missing-modality data conditions. To project various modality features to the same semantic space, we exploit a joint modality-invariant representation with a central moment discrepancy (CMD) based distance constraint training strategy. We conduct extensive experiments on several benchmark datasets, such as VAP RGBD-T and Lock3DFace, and the results demonstrate the effectiveness and robustness of the proposed approach under uncertain missing-modality conditions compared with all baseline algorithms.

Supported by organization CloudWalk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bebis, G., Gyaourova, A., Singh, S., Pavlidis, I.: Face recognition by fusing thermal infrared and visible imagery. Image Vision Comput. 24(7), 727–742 (2006)

    Article  Google Scholar 

  2. Cai, L., Wang, Z., Gao, H., Shen, D., Ji, S.: Deep adversarial learning for multi-modality missing data completion. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1158–1166 (2018)

    Google Scholar 

  3. Cui, J., Zhang, H., Han, H., Shan, S., Chen, X.: Improving 2D face recognition via discriminative face depth estimation. In: 2018 International Conference on Biometrics (ICB), pp. 140–147. IEEE (2018)

    Google Scholar 

  4. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  5. Du, C., et al.: Semi-supervised deep generative modelling of incomplete multi-modality emotional data. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 108–116 (2018)

    Google Scholar 

  6. Goodfellow, I.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  7. Guerrero, R., Pham, H.X., Pavlovic, V.: Cross-modal retrieval and synthesis (x-mrs): Closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3192–3201 (2021)

    Google Scholar 

  8. Han, J., Zhang, Z., Ren, Z., Schuller, B.: Implicit fusion by joint audiovisual training for emotion recognition in mono modality. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5861–5865. IEEE (2019)

    Google Scholar 

  9. Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)

    Google Scholar 

  10. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594 (2021)

    Google Scholar 

  11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  12. Lee, S., Yu, Y., Kim, G., Breuel, T.M., Kautz, J., Song, Y.: Parameter efficient multimodal transformers for video representation learning. In: 9th International Conference on Learning Representations, ICLR (2021)

    Google Scholar 

  13. Liu, A.H., Jin, S., Lai, C.I.J., Rouditchenko, A., Oliva, A., Glass, J.: Cross-modal discrete representation learning. arXiv preprint arXiv:2106.05438 (2021)

  14. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018)

  15. Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)

    Google Scholar 

  16. Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)

    Google Scholar 

  17. Mu, G., Huang, D., Hu, G., Sun, J., Wang, Y.: Led3d: a lightweight and efficient deep approach to recognizing low-quality 3D faces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5773–5782 (2019)

    Google Scholar 

  18. Nikisins, O., Nasrollahi, K., Greitans, M., Moeslund, T.B.: Rgb-dt based face recognition. In: 2014 22nd International Conference on Pattern Recognition, pp. 1716–1721. IEEE (2014)

    Google Scholar 

  19. Pham, H., Liang, P.P., Manzini, T., Morency, L.P., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899 (2019)

    Google Scholar 

  20. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. IEEE (2016)

    Google Scholar 

  21. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

    Google Scholar 

  22. Seal, A., Bhattacharjee, D., Nasipuri, M., Gonzalo-Martin, C., Menasalvas, E.: Fusion of visible and thermal images using a directed search method for face recognition. Int. J. Pattern Recogn. Artif. Intell. 31(04), 1756005 (2017)

    Article  Google Scholar 

  23. Shi, Y., Paige, B., Torr, P., et al.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. Adv. Neural Inf. Process. Syst. 32, 1–12 (2019)

    Google Scholar 

  24. Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 7242–7252 (2021)

    Google Scholar 

  25. Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018)

  26. Uppal, H., Sepas-Moghaddam, A., Greenspan, M., Etemad, A.: Depth as attention for face representation learning. IEEE Trans. Inf. Forensics Secur. 16, 2461–2476 (2021)

    Article  Google Scholar 

  27. Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)

    Google Scholar 

  28. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)

  29. Zellers, R., et al.: Merlot: multimodal neural script knowledge models. Adv. Neural Inf. Process. Syst. 34, 23634–23651 (2021)

    Google Scholar 

  30. Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., Saminger-Platz, S.: Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017)

  31. Zhang, J., Huang, D., Wang, Y., Sun, J.: Lock3dface: a large-scale database of low-cost kinect 3d faces. In: 2016 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2016)

    Google Scholar 

  32. Zhang, Z., et al.: Ufc-bert: unifying multi-modal controls for conditional image synthesis. Adv. Neural Inf. Process. Syst. 34, 27196–27208 (2021)

    Google Scholar 

  33. Zhao, J., Li, R., Jin, Q.: Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 2608–2618 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yizhe Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, Y., Sun, X., Zhou, X. (2023). Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14255. Springer, Cham. https://doi.org/10.1007/978-3-031-44210-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44210-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44209-4

  • Online ISBN: 978-3-031-44210-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics