Abstract
Current RGB-D-T face recognition methods are able to alleviate the sensitivity to facial variations, posture, occlusions, and illumination by incorporating complementary information, while they rely heavily on the availability of complete modalities. Given the likelihood of missing modalities in real-world scenarios and the fact that current multi-modal recognition models perform poorly when faced with incomplete data, robust multi-modal models for face recognition that can handle missing modalities are highly desirable. To this end, we propose a multi-modal fusion framework for robustly learning face representations in the presence of missing modalities, using a combination of RGB, depth, and thermal modalities. Our approach effectively blends these modalities together while also alleviating the semantic gap among them. Specifically, we put forward a novel modality-missing loss function to learn the modality-specific features that are robust to missing-modality data conditions. To project various modality features to the same semantic space, we exploit a joint modality-invariant representation with a central moment discrepancy (CMD) based distance constraint training strategy. We conduct extensive experiments on several benchmark datasets, such as VAP RGBD-T and Lock3DFace, and the results demonstrate the effectiveness and robustness of the proposed approach under uncertain missing-modality conditions compared with all baseline algorithms.
Supported by organization CloudWalk.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bebis, G., Gyaourova, A., Singh, S., Pavlidis, I.: Face recognition by fusing thermal infrared and visible imagery. Image Vision Comput. 24(7), 727–742 (2006)
Cai, L., Wang, Z., Gao, H., Shen, D., Ji, S.: Deep adversarial learning for multi-modality missing data completion. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1158–1166 (2018)
Cui, J., Zhang, H., Han, H., Shan, S., Chen, X.: Improving 2D face recognition via discriminative face depth estimation. In: 2018 International Conference on Biometrics (ICB), pp. 140–147. IEEE (2018)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Du, C., et al.: Semi-supervised deep generative modelling of incomplete multi-modality emotional data. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 108–116 (2018)
Goodfellow, I.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Guerrero, R., Pham, H.X., Pavlovic, V.: Cross-modal retrieval and synthesis (x-mrs): Closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3192–3201 (2021)
Han, J., Zhang, Z., Ren, Z., Schuller, B.: Implicit fusion by joint audiovisual training for emotion recognition in mono modality. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5861–5865. IEEE (2019)
Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, S., Yu, Y., Kim, G., Breuel, T.M., Kautz, J., Song, Y.: Parameter efficient multimodal transformers for video representation learning. In: 9th International Conference on Learning Representations, ICLR (2021)
Liu, A.H., Jin, S., Lai, C.I.J., Rouditchenko, A., Oliva, A., Glass, J.: Cross-modal discrete representation learning. arXiv preprint arXiv:2106.05438 (2021)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018)
Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)
Mu, G., Huang, D., Hu, G., Sun, J., Wang, Y.: Led3d: a lightweight and efficient deep approach to recognizing low-quality 3D faces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5773–5782 (2019)
Nikisins, O., Nasrollahi, K., Greitans, M., Moeslund, T.B.: Rgb-dt based face recognition. In: 2014 22nd International Conference on Pattern Recognition, pp. 1716–1721. IEEE (2014)
Pham, H., Liang, P.P., Manzini, T., Morency, L.P., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899 (2019)
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. IEEE (2016)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Seal, A., Bhattacharjee, D., Nasipuri, M., Gonzalo-Martin, C., Menasalvas, E.: Fusion of visible and thermal images using a directed search method for face recognition. Int. J. Pattern Recogn. Artif. Intell. 31(04), 1756005 (2017)
Shi, Y., Paige, B., Torr, P., et al.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. Adv. Neural Inf. Process. Syst. 32, 1–12 (2019)
Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 7242–7252 (2021)
Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018)
Uppal, H., Sepas-Moghaddam, A., Greenspan, M., Etemad, A.: Depth as attention for face representation learning. IEEE Trans. Inf. Forensics Secur. 16, 2461–2476 (2021)
Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
Zellers, R., et al.: Merlot: multimodal neural script knowledge models. Adv. Neural Inf. Process. Syst. 34, 23634–23651 (2021)
Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., Saminger-Platz, S.: Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017)
Zhang, J., Huang, D., Wang, Y., Sun, J.: Lock3dface: a large-scale database of low-cost kinect 3d faces. In: 2016 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2016)
Zhang, Z., et al.: Ufc-bert: unifying multi-modal controls for conditional image synthesis. Adv. Neural Inf. Process. Syst. 34, 27196–27208 (2021)
Zhao, J., Li, R., Jin, Q.: Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 2608–2618 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, Y., Sun, X., Zhou, X. (2023). Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14255. Springer, Cham. https://doi.org/10.1007/978-3-031-44210-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-44210-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44209-4
Online ISBN: 978-3-031-44210-0
eBook Packages: Computer ScienceComputer Science (R0)